Comments on IL Simultaneous v. Sequential Lineup Field Test

 

Ebbe B. Ebbesen

University of California, San Diego

May 2, 2006

 

The following are some comments, additional data analyses, and theoretical results concerning the IL Pilot Project on Sequential Double Blind Eyewitness Procedures. These comments are in addition to the data analyses and conclusions reached in the report and appendices to those reports that are already in the public domain (see http://www.chicagopolice.org/IL%20Pilot%20on%20Eyewitness%20ID.pdf for the main report and http://www.chicagopolice.org/Apndx%20to%20IL%20Pilot%20on%20Eyewitness%20ID.pdf for the appendix to the report. The latter includes the main data analyses that our laboratory conducted.)

 

Is the experiment flawed?

 

Some have suggested (e.g., Wells) that the experimental design used in the pilot project is flawed because it compared eyewitness performance from simultaneous lineups without blind administrators with performance from sequential lineups with blind administrators. There are a number of reasons why this criticism is irrelevant and should not undermine the conclusions from the study.

 

  1. As with virtually all program evaluation research in which one is proposing to replace a standard practice with a new (and what some claim is an improvement over the old) procedure, the standard program-evaluation design is old (control) v. new (experimental).
    1. Often when research like this is done, we know that the old and the new differ in many ways but treat the new, experimental approach, as a package.

                                                    i.     This is because those who are promoting the new approach often claim that every aspect of their new package is critical.

1.     In the present case, advocates of the sequential procedure were suggesting the an important component of the improvements that the new procedure would bring come from the fact that investigators running the lineups would be not know who the suspect was.

2.     A corollary of this suggestion is that the standard practice suffers both because it presents items simultaneously (and therefore allows relative decisions rather than absolute ones) and because lineup administrators can influence (mostly by displaying unconscious behavioral cues) witnesses to choose the suspect.

3.     Notice that the recommendation for the new procedure assumes that BOTH blinding of the lineup administrator AND the sequential presentation are necessary to see significant improvement in eyewitness performance.

                                                  ii.     As a result, we can’t make the new package simpler so that the old and new differ in only one respect (e.g., sequential without blind to compare to simultaneous without blind). Had we done this, critics would have complained the package was incomplete and didn’t provide a fair test of the suggested reform.

                                                iii.     Similarly, we cannot change the “standard” method by adding features that are claimed to the be an essential part of the new process so that the experiment examines just the effect of one aspect of the new procedure, e.g., blind simultaneous with blind sequential. Were we to do this, we would not know how well the old compares to the new because the old, standard procedure was never assessed.

    1. In medical research, for example, the new procedure might be the use of three new experimental drugs, all in combination, compared to the old method of just one “standard” drug. We do not generally design the evaluation research to run all possible combinations, a, b, c, a+b, a+c, b+c, and a+b+c until we know that the recommended package, abc (all of them together), works better than the old way. We simply compare the old to abc.

                                                    i.     Notice that a critic of the results could always say, but the reason the results turned out the way they did is because “a” was included.

                                                  ii.     That is, the change from old to new was due to “a” and not to a+b+c. We wouldn’t know after the results were obtained.

                                                iii.     What we do know is that the package of a+b+c produces a different or the same result as the old procedure.

    1. Similarly, in the present case, we know from the results of the IL project that the “best” sequential+blind is NOT better than the “standard”, old way (simultaneous as it is always done without a blind administrator).
    2. These results are contrary to the initial expectations of those who recommended that the new package be tired. Analysis of laboratory results comparing simultaneous and sequential lineups (Ebbesen and Flowe, 2003) lead those willing to generalize from these results to expect that both foil choices and suspect choices should have been lower in the sequential+blind than in the simultaneous standard procedure. Obviously, the results from the IL project are inconsistent with this expectation.

                                                    i.     It is worth noting that the failure of the field results to confirm a generalization based on laboratory research is consistent with a position that I and few colleagues (e.g., Konecni, Yuille, Egeth) have making since the mid 1980s.

                                                  ii.     In Ebbesen and Konecni, 1989 we argued that many of the conclusions reached by those testifying in court about eyewitness identification could not be fairly generalized because the research methods and results were incomplete. In particular, they were typically conducted in laboratory settings in which the motivations of the subjects, the viewing conditions, and the selection of subjects, and the methods used to measure memory might not simulate the conditions typical of actual crimes. Appropriate field trial were not done and as a result, key processes (e.g., the fact that witnesses, prosecutors, and investigators select witnesses and cases in the legal system) were not being duplicated in the laboratory.

  1. One does not typically design program-evaluation research to discover which of the many differences between the old and new might be effective until we know whether the new is better than the old.
  2. This project was NOT designed to answer specific theoretical questions.
    1. For example, even if we did use the double-blind method in both the simultaneous and in the sequential lineup procedures, there would have been a host of differences between the two procedures that a critic could have pointed to if the results did not turn out as the critic had hoped.

                                                    i.     The blinding procedure itself might have been different for simultaneous and sequential. It is likely that a different number of investigators are required for the two procedures and as a result, the protocols would necessarily have been different.

                                                  ii.     The sequential procedure could have been varied in a host of different ways and a critic could have “explained away” results by saying that we should have run the sequential procedure differently than the way we chose to do it

1.     Critics could argue that we should have given different instructions to witnesses and/or to the investigators. They could have claimed that our instructions were not what they should have been (despite the fact that there are no specifics regarding the precise wording of instructions that should be used, e.g., whether the witness will be allowed to see all items even after they select one, whether they will be allowed to look at all of them together after looking at each one alone, whether they should withhold their decision until they have seen all of them as in the procedure that was used in the Steblay study in MN – a procedure that has never been studied in laboratory simulation research), and so on.

2.     Critics could argue that we ran our sequential protocol incorrectly because we did not tell witnesses that they would be looking at a large number of alternatives rather than just the number in lineup. This was originally thought to be an essential component of the sequential procedure because it would prevent witnesses from thinking that the last item in the lineup was their last chance to pick someone.

3.     Critics could argue that the results were due to the method used to select fillers.

4.     Etc.

    1. We did not propose to test whether the relative decision or some other process was at work and attempt to test the impact of one, narrow, specific feature of the procedure. This was clearly applied research to determine which of two packages was better, the simultaneous lineup procedure as it was typically being done in IL and the new sequential procedure as recommended by some vocal critics of the status quo.
  1. Even if we had run blind+simultaneous and blind+sequential and found similar results to those described in the report, critics would have said, “Of course, blinding is sufficient to eliminate the problems with simultaneous”. Damned if we do and damned if we don’t.
  2. It is of interest to note that critics of the IL study never expressed any concerns about Steblay’s Hennipen, MN study despite the fact that this study did not have a control condition. That is, it was NOT an experiment. As a result Steblay was initially forced to compare results from the field trial to laboratory data collected using protocols and procedures that differed from the field trial in many more ways than the two conditions that the IL studied compared. More recently, Steblay is apparently collecting results from simultaneous lineups run prior to changing to the sequential procedure. It will be interesting to see if critics of the present study note that this comparison lacks an essential aspect needed to conclude that any differences were due to specific procedural changes, namely, randomization.
    1. It is also of interest that Steblay’s comparison will necessary be similar to the one made in this study, namely, simultaneous without blind administrators verses (a type of) sequential with blind administrators. It is odd that critics of the present study have not suggested that any results from the new analyses would be worthless because they lacked a blind simultaneous comparison group.
    2. Furthermore, the protocol for the sequential procedure used by Steblay makes it very difficult to know how to compare results from it to other sequential procedures since Steblay’s protocol allowed witnesses to view all pictures before choosing one – a procedure that seems very similar to a simultaneous lineup.
  3. Some critics have routinely claimed that other jurisdictions have shown that sequential lineups are successful (e.g., NJ, NC, Santa Clara County, CA). However, not only did these pilot programs not have any control group, they also did not have an experimental group in which eyewitness performance outcomes were measured. Instead, success is defined in terms of “vague subjective impressions” collected either unsystematically or systematically from those who were running the lineups. Thus, the IL project is the first study that compares in a randomized experimental design, outcomes that actually measure witness performance in a standard lineup procedure with witness performance in a new, experimental, lineup procedure. This is a major step forward in evaluation protocol.
  4. Much laboratory research in this field often leaves out important conditions that would help us evaluate whether results can be generalized to the real world.
    1. Consider the worry that investigator feedback after the ID will cause witnesses to become more confident in their ID. This research typically only examines the effect of feedback on responses to lineups without the guilty culprit present.

                                                    i.     In research on feedback effects, witnesses who correctly ID the suspect are never provided with accurate feedback. Only witnesses who have viewed blank lineups are studied.

                                                  ii.     In addition, the videotapes of simulated crimes shown to college students in these studies generally produce no or very weak memory for the culprit (witnesses view a grainy videotape of a culprit in which they barely see his face for a few seconds while he is running around). This research never examines results from witnesses who may have gotten a good look at the suspect and formed a strong memory of his appearance.

    1. When results from laboratory studies are analyzed researchers almost never discuss the effects of the factor (stress, duration of exposure, other-race ID, etc.) on just those witnesses who would be likely to be involved in a real case (e.g., those with high confidence in their ability to ID). Witnesses, who say they never saw, don’t want to, etc. before the fact, and witnesses who say they were just guessing after the fact, are not used to build a case and won’t appear in court. But these witnesses constitute the majority of those that produce laboratory simulation data for many of the conclusions that are based on laboratory studies.
  1. Some critics of this study have proposed some specific protocol changes that have NO basis in research (at this time!).
    1. For example, Wells supported the idea that if simultaneous lineups are run (instead of sequential procedures), the lineups should contain nine people. This recommendation was made to the California Commission on the Fair Administration of Justice in April, 2006.

                                                    i.     When asked why nine during discussion after a presentation to the committee, Wells’s response was that nine is a good number. It is not too large and not too small. He said that the law of large numbers, for which there is considerable support both in psychology and economics, suggests that differences between large numbers are psychologically smaller than differences between smaller numbers. Therefore, 3-4 would be too little and 15 would be too many. So nine is about right.

    1. Recommendations for recording confidence immediately after an ID seems like a reasonable idea on its surface, but the consequences of this for real world investigations and case outcomes has not been studied. No research has been done on how best to record confidence, for example.
  1. Gary Wells, a major critic of this study, has endorsed a recommendation for changes in eyewitness evidence protocol that would involve a huge number of “confounding” variables. In CA, for example, he recently testified that over 10 different changes in procedure should be made all at one time. Were before-after data collected to see the effect of these changes, we could not possibly know which of them was responsible for any observed changes in eyewitness behavior. Furthermore, it is worth noting that Dr. Wells’ recommendations made to the California Commission on the Fair Administration of Justice did not include a recommendation that the effect of these changes in protocol be accompanied by randomized experimental procedures in which witness performance measures would be collected. It seems odd that some critics would be critical of the design of the present study when the same critics fail to require any experimental field tests of the protocol changes that they recommend.

 

Administrator influence as an explanation for the findings

 

One alternative explanation for the findings, favored by critics of the IL experiment, is that the failure to use blind administrators in the simultaneous lineup allowed the administrators to “influence” which item the witnesses chose from the lineups. Such influence would cause the witnesses to choose more suspects and fewer foils. The administrator would (mostly unconsciously) “steer” the witness to what the administrators believed was the correct choice, namely, the suspect. This position assumes that administrators “leak” behavioral cues (or if the behavior is conscious, that the administrator simply decides to do or say something informative) that are diagnostic of their knowledge about who is and is not the suspect. That is, the cues contain information that point to the correct suspect. This explanation also assumes that the witnesses (consciously or unconsciously) attend to these cues in a way that allows them to detect the information about who is and is not the suspect. This explanation assumes that once they notice the relevant cues, the witnesses will tend to alter their decisions in the direction of the suspect and away from fillers.

 

A corollary of this explanation assumes that the extra information (behavioral cues provided by the administrator) about who the suspect is will also supply some type of consensual validation for the witnesses’ choices. If witnesses pick the same person that they believe the administrator “knows” is the suspect, the witnesses should feel more confident in their decisions. After all, they are agreeing with a person who knows who committed the crime. If, on the other hand, witnesses choose a filler, the fact that they are disagreeing with someone who knows who the guilty person is should lower their confidence.

                                                                        

My position is not that critics are wrong in his explanation, but that they have no idea whether it is right. In addition, the critics have presented no evidence to support this explanation over other equally, if not more reasonable, explanations. The following are tests of the validity of this explanation using results from the IL study. These tests are designed to determine whether implications of the “administrator influence hypothesis” are consistent with results from the IL experiment.

 

  1. Memory Strength and the Administrator Influence Hypothesis.

 

If the “administrator influence” explanation for the higher filler choice rate in sequential than in simultaneous lineups is correct, one might expect investigator influence to increase as witnesses’ memories of the culprit become weaker. The weaker the memory of the culprit, the more the witness might look to other sources of information about who to pick. Alternatively, the stronger the memory, the less the witness might be swayed by someone else. While we don’t have a direct measure of memory strength, we can infer that various conditions of exposure and testing might increase or decrease the witnesses’ memory of the culprit. For example, we could predict that as the duration of exposure to the culprit increases, memory for the culprit also increases (in general). Alternatively, memory for culprits who are the same race as the witness might be stronger than memory for culprits who are of a different race than the witness. While there was not enough information in the files to assess duration of exposure, we were able to code the race of the suspect and the race of the witness for almost all of the ID attempts. If it is true that administrators influence witnesses by leaking cues, we might expect this influence to be strongest when the witnesses are less sure about which person is actually the suspect, namely, when other-race identifications are being made (and they “all look alike”). Table 1 shows the rate of suspect choices and foil (filler) choices (one can compute the no choice percentage by subtracting these from 100%) as a function of the type of lineup (photo v. physical or live), lineup procedure (simultaneous v. sequential) and whether the witness and suspect (and therefore fillers) were of the same or a different race.

 

If this form of the administrator hypothesis is correct, we should expect to see more suspect and fewer filler choices for other-race IDs than for same race IDs in simultaneous but not in sequential lineups (because the administrators were blind in the latter procedure). The results in Table 1 are inconsistent with this prediction. First, we can see that the suspect choice rates were much lower for other race IDs than same race IDs in all but the simultaneous, physical lineups. Second, foil choice rates were not higher when the witness and suspect were in different racial categories.[1] Second, the difference between same and other race suspect choice rates should be smaller for simultaneous than sequential lineups (because more witnesses in the simultaneous lineup who would otherwise not choose the other-race suspect were induced to do so). For Photo lineups the same-other race difference was (53.7-28 =) 25.7% for sequential lineups and it was (64-26.5 =) 37.5% for simultaneous lineups. Thus, the difference was actually larger, not smaller, in simultaneous lineups. For physical lineups, on the other hand, the sequential lineups same-other race difference was (50-21.2 =) 28.8% and the simultaneous difference was (80.6-83.3 =) -3.3%. Thus, the data for live lineups appears inconsistent with the results from photo lineups. The opposite prediction should apply to foil choices. That is, the rate of foil choices should be that much lower in simultaneous lineups when other-race IDs are being made than when same-race IDs are being made. For photo lineups the same-other foil choice rate difference was (9.8-8 =) 1.8% for sequential lineups and it was (0-2.9 =) -2.9% for simultaneous lineups. But for live lineups, the difference was 13.2% for sequential lineups and 0% for simultaneous lineups. Again, the results seem inconsistent. At this point, we do not know why these inconsistencies exist. Nevertheless, it is clear that the pattern predicted from the administrator leakage/influence hypothesis does not appear to be supported.

 

Table 1. Frequency of Suspect and Filler Choices as a Function of Lineup Type and Lineup Procedure and Racial Similarity of the Witness and Suspect.

Lineup Type

Lineup Procedure

Racial Similaritya

% Suspect Choices

% Foil Choices

Photo

Simultaneous

Other

26.5

2.9

Photo

Simultaneous

Same

64.0

0.0

Photo

Sequential

Other

28.0

8.0

Photo

Sequential

Same

  53.7

9.8

Physical  

Simultaneous

Other

83.3

0.0

Physical

Simultaneous

Same

80.6

0.0

Physical

Sequential

Other

21.2

0.0

Physical

Sequential

Same

50.0

13.2

a. This refers to whether the witness and the suspect (and therefore all of the fillers) were in the same or different racial categories.

 

  1. Social Context and the Administrator Influence Hypothesis

 

If the administrator influence explanation is correct, one might expect such influence to be less when the surrounding context makes such influence more difficult or less likely because of the presence of others, e.g., with a photo lineup rather than with a physical lineup. It seems reasonable to assume that multiple investigators and prosecutors are more likely to be present at live than photo lineups. As a result, investigators might not be in as good physical positions for their behavioral cues to be monitored by witnesses (assuming that they give off such cues on a regular basis). In addition, the layout of the rooms used in live lineups will generally not place the administrator directly in front of the witnesses as seems more likely in the case of photo lineups. Results are inconsistent with this view, however. Table 2 presents the suspect and foil choice rates as a function of lineup type and lineup procedure. Examination of the results in Table 2 shows that if anything, suspect choice rates were higher in physical lineups than in photo lineups. This means that no choice rates were highest when they should have been lowest (simultaneous photo lineups).

 

Table 2. Frequency of Suspect and Filler Choices as a Function of Lineup Type and Lineup Procedure.

Lineup Procedure

Lineup Type

% Suspect Choices

% Foil Choices

% No Choices

Simultaneous

Photo

52.6

1.3

46.1

Sequential

Photo

43.8

9.4

46.8

Simultaneous

Physical

81.8

0.0

18.2

Sequential

Physical

46.4

5.4

41.0

 

  1. Witness Confidence and the Administrator Influence Hypothesis

 

If, as critics suggest, investigators in the simultaneous procedure where “consciously or unconsciously” suggesting which of the alternatives to pick (because they were not blind), we might expect the witnesses to be much more confident in their choices in the simultaneous lineup procedure than in the sequential procedure. After all, in the simultaneous procedure, the witnesses choices would be “reinforced” either by the pre (and/or) post responses of the investigators who knew which alternative was the suspect. “Good, you picked our suspect” might be a response provided by those who were not blind or suggestions might be made prior to the choice as to who the suspect was (“We all know it is number 3.”)

 

Table 4a shows the number of witnesses who expressed “high”, “moderate”, and “low” confidence broken down by lineup procedure and Table 4b show the same result as percentages within each lineup procedure.

 

Table 4a. Number of Witnesses Viewing Simultaneous and Sequential Lineups Who Expressed High, Moderate, and Low Confidence in Their Responsesa

Lineup Procedure

Confidence

Total

High

Moderate

Low

Simultaneous (Not Blind)

101

18

10

129

Sequential (Blind)

158

27

20

205

Total

259

45

30

334

a These results are for the subset of witness ID attempts for which confidence estimates were available.

                                           


Table 4b. Percent of Witnesses Viewing Simultaneous and Sequential Lineups Who Expressed High, Moderate, and Low Confidence in Their Responses

Lineup Procedure

Confidence

Total

High

Moderate

Low

Simultaneous (Not Blind)

78.29

13.95

7.75

129

Sequential (Blind)

77.07

13.17

9.76

205

Total

259

45

30

334

 

The results in Tables 4a and b are inconsistent with the idea that investigators influenced the witnesses’ choices to an extent that made them feel more confident in those choices. We can see that the percentage of high confident witnesses was virtually identical for the two lineup procedures despite the fact that those who administered the simultaneous lineups knew who the suspect was.

 

We can make the above argument even stronger by noting that if the administrator was leaking cues to pick the suspect (and not the fillers) during the simultaneous lineups, only those witnesses that picked the suspect would have the consensual validation of the their choices. Those who picked the fillers would actually be disagreeing with the administrator’s influence attempt. This reasoning predicts that the witnesses viewing the simultaneous lineup who chose the suspect should be more confident in those choices than witnesses who chose the suspect from a sequential lineup. In contrast, those who chose the fillers from a simultaneous lineup should be less confident than those who chose fillers from a sequential lineup. We analyzed the percent of witnesses who expressed high confidence (see next two paragraphs) for those who chose the suspect and for those who chose fillers. For the simultaneous lineup, 69 out of 87 (or 79.3% of the) witnesses who chose the suspect did so with high confidence. For sequential lineups, 118 out of 140 (or 84.3% of the) witnesses who chose the suspect did so with high confidence. Thus, if anything, witnesses were more likely to be confident in their suspect choices in sequential/blind lineups than in simultaneous lineups.

 

When the filler choices were examined, 66.7% of the filler choices made to simultaneous lineups and 21.5% made to sequential lineups were made with high confidence. While the Ns are small, if anything, the trend is opposite to the investigator influence explanation for the results. Thus who chose a filler from a simultaneous lineup were more confident even though their choices should have disagreed with the influence attempts of the administrator (assuming they existed).

 

One might argue that this analysis is flawed because the coding of confidence is invalid. Garbage in, garbage out. To determine whether the results from the this study support this view, we examined the extent to which self-reported confidence predicted which witnesses would make mistakes. Although many psychologists who have testified for the defense in criminal cases argue that witness confidence is a poor predictor of witness accuracy, I have argued that this claim is false when the relationship is correctly measured. If I am correct that there is some relationship between confidence and accuracy among actual witnesses to real crimes and the confidence coding that we did was valid, we might expect less confident witnesses to be more likely to make mistakes.[2] Table 9 shows the choice results for each procedure (for all witness/suspect lineups) broken down into low (“I think that’s him, but I can’t be positive.”, “"He looks like the guy, but I'm not positive.", "#1 could have been the passenger.", “Only 45% sure.”), moderate ("Yes, that looks like the guy.", "Looks like him, he was husky like that.", "80-90% sure"), and high ("That's him. I'm certain.", “100% sure.”, "100% absolutely positive.”, "I'm positive that's the one that shot me.", "Yep, that's him. I'm sure, 200%") confidence. We can use these results to determine the percentage of identifications (suspect and filler) that were known errors. Table 5 shows these results. It presents the suspect and filler choice rates for both simultaneous and sequential lineups broken down into subcategories according to the level of confidence that witnesses expressed in their choices.[3]

                                               

Table 5.

Procedure

Confidence

Number of Suspect Choices

Number of Filler Choices

Number of  No Choices

% Suspect

% Filler

Simultaneous

NA

177

2

109

61.46

0.69

Sequential

NA

29

10

76

25.22

8.70

Simultaneous

High

69

4

28

68.32

3.96

Simultaneous

Moderate

14

2

1

82.35

11.76

Simultaneous

Low

4

0

6

40.00

0.00

Sequential

High

118

3

37

74.68

1.90

Sequential

Moderate

19

7

1

70.37

25.93

Sequential

Low

3

4

13

15.00

20.00

 

The results in Table 5 show that (with the exception of the lowest confidence choices in the simultaneous procedure with an N = 4), the likelihood that a positive ID of someone would be a filler increased as the confidence that witnesses expressed in their choices decreased. Thus, filler choices accompanied by expressions of high confidence occurred in about 4% of the simultaneous and 2% of the sequential lineups. On the other hand, filler choices accompanied by expressions of less than high confidence (moderate plus low) occurred in about 7.4% of the simultaneous and about 23.4% of the sequential lineups.

 

We can analyze the data in Table 5 slightly differently. We can ask what percent of the positive identifications made by the witnesses were of suspects as opposed to the fillers at each confidence level. Table 6 shows these results. We can see that, in general, the higher the confidence level, the more witnesses tended to identify suspects rather than known innocents (fillers). These results tend to validate the confidence analyses presented earlier and strengthen the conclusion that administrators were not influencing witnesses more in simultaneous than sequential lineups.

 

Table 6. Percent of Positive IDs that were Filler Choices as a Function of Lineup Procedure and Category of the Witness’s Expressed Confidence in the ID

Procedure

Confidence

% of Positive IDs that were a (Known) Error

Simultaneous (N = 178)

NA

1.12

Sequential (N = 39)

NA

25.64

Simultaneous (N = 73)

High

5.48

Simultaneous (N = 16)

Moderate

12.5

Simultaneous (N = 4)

Low

0.00

Sequential (N = 121)

High

2.48

Sequential (N = 26)

Moderate

26.92

Sequential (N = 7)

Low

57.14

 

  1. Status of Witness and the Administrator Influence Hypothesis

 

The person making the identification could have been a victim of the criminal act or simply a witness to the action. We analyzed whether the status of witness had any effect of choice rates. Table 11 shows the choice rates for all lineups as a function of the status of the witness. We can see when all of the lineups for which the information was available are examined, there was no effect on choice rates of witness status.

 

Table 7. Number and Percent of Witnesses Choosing Suspects, Fillers, or No One as a Function of Witness Status: Witness to Crime or Victim of Crime

Staus

# Suspects

# Fillers

# No Choice

% Suspect

% Filler

NA

6

0

14

30.00

0

Victim

242

20

152

58.45

4.83

Witness

186

12

105

61.39

3.96

 

We can also examine whether witness status played a different role in simultaneous than sequential lineups. Because the consequences of making a choice are different for the two types of witnesses, we might expect those who simply witnessed the crime to be less likely to “want” to convict someone, just anyone. This tendency might be something that the investigators can take advantage of when they know who the suspect is in the lineup. If so, we might expect victims to be less likely to pick foils and more likely to pick suspects, but only when presented with the simultaneous procedure in which the investigators knew who the suspect was.

 

The results in Table 12 are inconsistent with this view. As can be seen, victim and witness choice rates were identical for both lineup procedures. Stated differently, the effect of lineup procedure on choice rates was the same for witnesses and victims.

 

Table 8. Number and Percent of Witnesses Choosing Suspects, Fillers, or No One as a Function of Lineup Procedure and Witness Status: Witness to Crime or Victim of Crime

Procedure

Staus

# Suspects

# Fillers

# No Choice

% Suspect

% Filler

Simultaneous

Victim

153

5

83

63.49

2.07

Simultaneous

Witness

111

3

50

67.69

1.83

Sequential

Victim

88

15

69

51.17

8.72

Sequential

Witness

75

9

55

53.96

6.47

 

  1. Jurisdiction Matters

 

Whether critics are correct in their explanation for the differences in results between simultaneous and sequential lineups, it is clear that we cannot make a general claim for the relative effectiveness of the two methods because we found differences between jurisdictions in the outcomes.

 

Table 3 (Table 7 from report): Number and Percent of Suspect and Filler Choices for Known

Single Suspect Lineups as a Function of Jurisdiction and Lineup Procedure

Jurisdiction

Lineup Procedure

Number of Choices

 

Percent of Total Choices

 

Suspects

Fillers

No Choice

Suspects

Fillers

No Choice

Chicago

Simultaneous

121

0

68

64.0

0.0

36.0

Chicago

Sequential

102

13

91

49.5

6.3

44.2

Evanston

Simultaneous

31

0

12

72.1

0.0

27.9

Evanston

Sequential

23

7

22

44.2

13.5

42.3

Joliet

Simultaneous

111

8

61

61.7

4.4

33.9

Joliet

Sequential

43

4

15

69.4

6.5

24.2

 

If jurisdiction makes a difference, we can’t make a blanket recommendation because we can expect different results in different places. These results suggest that the administrator influence hypothesis would, if true, have to be modified to apply to only some jurisdictions based on some, as yet, unknown measurable difference between the jurisdictions.

 

 

Another reasonable explanation for the differences between simultaneous and sequential lineups in the IL experiment

 

If administrator influence is not the correct explanation for the slightly lower filler choice rate and higher suspect choice rate in simultaneous lineups, then how can we explain the results? There is another reasonable alternative. Sequential presentation of items discourages witnesses from picking the best item among the set that is ABOVE A GOOD ENOUGH CRITERION.[4]

 

The last section of these notes presents a thorough theoretical analysis of what we might expect witnesses to do under different decision-making models. The beginning point of these models is the assumption that witnesses have to compare a representation (image, feature list, perceptual reconstruction, etc.) of the culprit that they have in memory with each of the presented alternatives in the lineup. These comparisons are likely to be evaluated in terms of the degree of match between what is in the witness’s memory and the presented item. This degree of match can vary in strength from no match at all to a perfect match. Because the “degree of match” is a subjective psychological dimension, the witness must set a “criterion” that establishes whether the degree of match is high enough to say that the presented alternative is the culprit. If the degree of match is above the criterion, the witness will tend to say that the presented alternative is sufficiently close to her memory of the culprit to say that it is the culprit. If the degree of match is below the criterion, the witness will say that the alternative is not the culprit.

 

Different witnesses will tend to have different criteria. Instructions and consequences will tend to control where witnesses set their criteria. For example, if a witness is told that the culprit might not be in the lineup, they will tend to raise their criteria and be less willing to pick someone from the lineup than if they were told that the culprit is in the lineup.

 

In a simultaneous lineup, witnesses compare each alternative to their memory of the culprit. If all matches are below the criterion, witnesses will not choose an alternative from the lineup. If only one alternative is above the criterion, they will tend to pick that alternative as the culprit. If more than one is above the criterion, the witness will be able to pick the one alternative with the highest degree of match. This reasoning, when applied to sequential lineups might explain the results from the IL study.

 

In sequential lineups, when a "good enough" foil comes before the suspect, witnesses who would have picked the suspect over the foil (because the suspect is the best choice), might pick the foil because they haven't seen the suspect yet. That is, if the degree of match for a foil is above a criterion, the witness might pick this alternative because it is the first, and only, alternative seen up that point that is good enough to be the culprit. But, if the culprit appears later on, either the witness will not get to see the suspect because the lineup procedure stopped after the first choice or the witness might become committed to this first choice.[5] As a result, early foil choices might prevent later appearing suspects from being chosen.

 

We attempted to test this idea by breaking the foil choices down into those made before the suspect appeared (for the sequential lineup) and those that were made AFTER the suspect appeared. The results are:

 

We knew the size of the sequential lineup and the position of the suspect in the lineup for 317 sequential lineups. We measured the average percentage of foils that appeared before the suspect (48.3%) and the percentage that appeared after the suspect (51.7%). Thus, about the same number of foils appeared before the suspect as appeared after the suspect. If witnesses picked foils "randomly" from the sequence, we would expect them to pick foils that were before the suspect about as often as they picked a foil who appeared after the suspect. In fact, of the 24 foil choices that we coded for all sequential lineups, 16 were pre-suspect choices and 8 post suspect choices. Thus, there was a significantly greater chance that foils would be chosen before the suspect appeared than after the suspect appeared. It is of some interest that the number of post-suspect foil choices (8) is very close to the number of simultaneous lineup foil choices (6).

 

Related issues: Uncertainty in Protocol for Sequential Lineups

 

A number of different protocols are possible for how to run sequential lineups. The uncertainty in the range of possible protocols means that advocates should specify which protocol variation they are suggesting would be better than current procedures. If they argue that it does not matter which specific protocol is used, then they should be required to show evidence that there are no differences in the eyewitness performance across the different variations.

 

  1. What are witnesses led to believe about size of the lineup prior to being shown first person?
    1. Witness might be lead to expect to see a larger number of items than will actually be shown (to prevent witness from knowing that last item in the list is their last chance to pick someone)

                                                    i.     Here no item is the last one

    1. Witness might not know how many items will be in the lineup

                                                    i.     Here each item might be the last

    1. Witness might be told how many items are in the lineup

                                                    i.     Here witness knows that the last item is the last opportunity to pick

  1. What are witnesses told about what will happen after they choose an item
    1. Stop after the first choice

                                                    i.     Here witness believes that if they pick an early item in the list, they will not have the opportunity to see the remaining items

    1. Continue after first choice

                                                    i.     Here witness believes that if they pick an early item, they will still be able to see the remaining items but is there a cost to changing one’s choice?

    1. Don’t choose until all are presented

                                                    i.     Here each item is shown one at a time but witness has to withhold decision until all items are seen

1.     This might involve presenting an item, removing it after witness is finished looking, presenting the next item, and so on. After all items are viewed one at a time in sequence, the witness could be shown all items together for a final simultaneous viewing.

2.     Other alternatives could be used:

a.      Witness recalls one or more numbers

b.     Witness asks to see specific numbers (say, 1 and 4) again.

c.      Etc.

  1. What are witnesses told about whether they will be able to go through the sequence more than one time?
    1. Single pass

                                                    i.     Here witnesses know that if they don’t pick someone the first time through, they won’t have another chance to look over any of the pictures again

    1. Multiple pass

                                                    i.     Here witness might withhold his choice the first time through in order to see all of the items before making a decision.

  1. What are witnesses told about what will happen to each item as they go through the lineup
    1. Each item will be removed after they decide “yes or no” and then a new one will be presented.

                                                    i.     Here witness can only mentally review a previous item when presented with the next one in the list

    1. Each item will remain in view after they decide “yes or no”.

                                                    i.     Here witness can look back at items they have already rejected when looking at each new item in the list.

  1. How will suspects be assigned to position, at random or based on specific ideas about how position might affect choice rates?
    1. Will some positions be avoided

                                                    i.     Putting suspect in position one makes the lineup very much like a single person “showup” since there is no or limited opportunity for a witness to pick a foil and indicated that they have a very low decision criterion.

                                                  ii.     Putting suspect in last position might increase odds that witnesses will pick because this is the last opportunity for a witness to select someone.

    1. Witnesses might change their decision criterion as they go throw the sequence either because of the similarity among the alternatives that they see or because of the number to which they have already said, “no.”

 

Some critics suggest that these uncertainties have been easily handled in the past by various jurisdictions. What this means is that decisions have been made by different jurisdictions about precise aspects of the protocol, thereby eliminating any uncertainty (they had) in how to run the procedure. BUT, this response misses the point entirely. This simply means that we can choose one or more of the options above and say, this is what we will do. The problem is that there is virtually no laboratory or field research that tells what effect, if any, these different protocols will have on several critical measures of performance. In particular, despite the claim that the uncertainty has been resolved in the past, we have no idea how foil choice rates, suspect choice rates, and correct (guilty suspect vs. innocent suspect) suspect choice rates are affected by these variations. Equally, if not more importantly, we do not know how these procedures will affect position effects on all of these measures AND on the decision processes that witnesses actually use. (We show later how a careful theoretical analysis of some of these differences might lead to very different predictions about witness performance.)

 

When witnesses are presented with items is sequence, rather than all at once, we raise the distinct possibility that witnesses will use different decision criterion for different items as they progress through the sequence of items. For example, when presented with the first item, a witness might believe (depending on the particular protocol) that if she chooses the first item, she will not be able to see the remaining items. This could cause her to say, “no”, to a very good option (maybe even the culprit) because she thinks a better option might be coming later. On the other hand, when the witness believes that she is getting near the end of the list, she could either decide that the culprit is probably not in the lineup and so not pay attention to the last item or she might decide that she had better select this item because it is her last chance to pick someone.

 

If witnesses change their standards as they progress through the sequence, this would be the equivalent of giving different instructions to the witness for each item in the list, e.g.,

            First item: Pick only if you are absolutely certain that this item is the culprit

Last item: Pick if this item looks something like the culprit because it is your last chance.

If we do learn how position affects the decision processes that witnesses use and we decide that there are ideal placements of suspects (e.g., positions 3 and 4), what will happen if this information becomes public? What is the process by which position will be selected? Should suspects be allowed to decide; should position be randomly determined, or should position vary with the details of the protocol?

 

Another position effect issue is that some protocols can affect choice rates for different positions independent of the decision process that witnesses use. This is because with some protocols, picking an item early in the sequence prevents the witness from being able to pick an item later in the sequence.[6] This will affect who will get to see later items in the list and therefore affect how often suspects and foils will be selected as a function of their position in the sequence. These position effects are so clear that if we fail to find position effects in some sequential lineup protocols, it effectively means that witnesses MUST be altering their standards as they proceed through the sequence in order to counter balance these statistical constraints.

 

Theory and Results: Comparisons of Laboratory and Field Results

 

You might find the attached graphs of interest.

 

I did some simulations (simultaneous lineup of size six with a decision rule that says, compare results for each item in lineup to others and pick the best one, if that one is high enough – decision standard – then select it, otherwise say no). Varied the criterion (c), varied d’ (mean of suspect distribution compared to mean of average of foil distributions – that is, memory strength). These simulations show what one would expect in terms of the rate of suspect (whether guilty or innocent) and foil choices as a function of c and d’.

 

 

 

Two lineup types, TP and TA in which culprit is removed and replaces with innocent suspect. Foils stay the same.

 

Here d’ varies on a line of connected dots with lower d’s having lower hit|target present lineup rates. Assumes normal distributions of d’ values. The decision model is: Examine all items, pick the item that has the highest value only if it is above c. If more than one is above c, pick the highest. If none are above c, say no.

 

Another way of plotting the same outcomes, namely, holding d’ constant and seeing effect of changing the criterion.

 

 

 

In the above figure we can see how foil and suspect choice rates would change as the criterion increases from a low to a high value. Notice that criterion shifts generally have a bigger effect on foil choices than suspect choices until d’ gets fairly large. Then, changes in c have a bigger effect on suspect choices than foil choices.

 

We next present the results from laboratory simulation studies that compare Target Present (higher d’) with Target Absent (lower d’) simultaneous lineups.

 

 


 

 

 

Data are from 25 different simultaneous lineup laboratory experiments. Each connected pair of data points consists of results from a target present and a target absent lineup (presumably d’ is different for each but c remains constant across target present and target absent lineups). The gray line is what the strong form of the relative decision model would predict (e.g., always pick the best one).

 

Sequential Lineup: What is it?

 

I have routinely complained that one of the problems with sequential lineups is that what is meant, specifically, by a sequential lineup has not been specified. As a result, when the sequential procedure is “recommended”, the method by which the procedure should be run is often vague.

 

With a Stop Rule. As noted earlier, one protocol for a sequential procedure presents items one at a time and stops when, and if, a witness chooses one of the alternatives.

 

One of the key problems with this type of sequential protocol is that it prevents the witness from being able to choose the best example when there is more than one example that is a good match. Thus, in high similarity lineups (lineups in which the foils look a lot like the suspect), we might expect that one of the foils presented before the suspect might be a “good enough” match for the witness to pick him. However, were this foil and the suspect presented side by side, the witness might choose the suspect because the suspect is an even better match to the witness’s memory than the foil. One consequence of this is that more foils will be chosen when the suspect is placed later in the lineup. All foils that are “good” enough will be chosen before the witness even gets to see the suspect.

 

We can see the theoretical effects of position on choice rates by conducting a monte carlo simulation similar to that done for simultaneous lineups.

The decision rule in this case is

 

Compare the first presented item to memory, if the result of this comparison is a “good enough” match, that is above c, select this item. If not, then go to next item. Stop if an item is good enough to be picked. Say, “not present”, only if not one of the items in the sequence is above c.

 

The next three figures shows the predicted effects of d’ and c on suspect and foil choice rates, taking into account position in the lineup (but making the almost surely incorrect assumption of a constant c across position).

 

 

 

 

 

 

 

Note that when criterion placement is relatively high: Simultaneous and Sequential outcomes will be very similar at the same c and d’ values. Also note that under these conditions, the position of target in sequential lineup will have small effect.

Shift from TP to TA lineup will have effect on hit rates and not false alarms when decision criterion is high This is a critical result.

 

 

 

 

The above analyses make an interesting prediction. Namely, it suggests that sequential lineups should actually produce more filler choices than simultaneous lineups when the suspect is placed in later positions. This is because in a sequential lineup witnesses are prevented from comparing the subset of alternatives that are above the criterion and then picking the one that is most above the criterion. Instead, the witness is will pick the first item that is above the criterion, missing later alternatives, including the suspect, who might well be more above the criterion than the first “good enough” filler. This is major problem with sequential lineups! This reasoning and analysis suggests that being able to select the one alternative that is the most above the criterion is an advantage that will lead to better, not worse, eyewitness performance (all other things equal).

 

Another protocol for sequential lineups allows witnesses to view all of the items in the sequence rather than stop after the first choice. This protocol typically also allows the witness to choose more than one item. That is, if the witness chooses item 2, continues on, and then chooses item 5, the witness has chosen two items. We can code witness performance in such a task by counting all choices. If we do, we must decide what to do when a witness picks more than one item. If all of the picks by a witness are foils, we can say that the witness picked foils only. If the witness only picks the suspect, then we can call this a suspect selection. What do we do when a witness picks both suspect and one or more foils. We could ask the witness to pick the one that she believes is the culprit, we could simply count the witness as a bad witness and assume that one foil is equivalent to all foils, or we could say that any suspect choice is a selection of the suspect, regardless of other choices. Other rules might discount a suspect choice in proportion to the number of additional foils that he or she chose.

 

We can model these decision rules in the same way that we modeled the simple sequential and simple simultaneous lineups. Next figure shows what we would find if we coded any hit as a suspect choice regardless of the number of foil choices that accompanied the suspect choice. The different lines show the predicted choice rates at a given cutpoint as d’ changes.

 

 

In the next figure, the connected points show the choice rates at a given d’ as the criterion changes. Remember that criteria might change with different instructions, different payoff, different procedures, and/or different similarity structures in the lineup.

 

 

 

The next one assumes that we count suspects only choices and foil only choices. That is, we count any pattern that includes both foil and suspect as a no choice. This is a different analysis of the same data produced by sequential lineup monte carlo that assumes witness will pick any item that is above c as she goes through the entire sequence. It assumes c remains constant for all positions. Notice that this is equivalent to a sequential lineup with a stop rule in which the suspect is placed in position 1.

 

The next graph shows the same results but viewed when holding d’ constant to see effect of changing the criterion on choice rates:

 

 

 

Comparing this to the stop after first choice sequential lineup shows how changes in criterion will affect suspect and foil choices differently for given d’s (0, .5, 1, and 2). Notice how the pattern is very different from what would occur with the stop after first

 

 

Average of laboratory results plotted as before for simultaneous and sequential lineups. Notice that the top data points are for TP lineups and the bottom are for TA lineups (d’ can vary either because memory strength varies or because the picture that one is being tested on looks more or less like the representation people have in memory – this is because the decision process has to be based on some type of comparison between what is in memory and the picture (or person) in front of the witness.

 

 

 

 

 

This shows the same lab average results AND the results from whatever field data we have. There are several key points

 

  1. Results from field are more similar to TP results from simultaneous procedure in the lab than they are to either simul TA or seq TA lineups in the lab. Suggests that field data contain a much higher proportion of TP lineups than lab (in lab we know it is 50/50 – represented by outlined data points in the middle of the lines from the lab)
  2. Note how foil choice rates tend to be lower in the field than they are in TA lineups in the lab.
  3. We are really not sure why the results are different in field and the lab, but they are different. Clearly more work is needed before we proceed.
  4. We do not know how to explain jurisdictional variability, not only in absolute rates but also in differences between sequential and simultaneous.

 

 

Some Recommendations

  1. We might want to have a system in which more expensive and time consuming (but potentially more reliable) methods (e.g., videotaping) are used for those cases in which an error is worse (e.g., long prison sentence, death penalty) but in cases in which probation or short incarceration periods are anticipated, we could use less expensive and time consuming methods.
  2. More research is clearly necessary.
    1. Adopt an experimenting society approach to policy change.
    2. Always include in all recommendations for change methods and measures that will allow you to monitor whether procedure is doing a better job.

                                                    i.     Means you will have to agree on measures of success BEFORE protocols are changed.

                                                  ii.     Means protocols will have to include control groups (and/or phased random implementation with constant monitoring)

    1. Need much more in the way of THEORY development so we can tell what theories actual predict about outcomes.

                                                    i.     Relative Decision Model is clearly wrong in its strong form.

                                                  ii.     Weak Form of Relative decision model is vague it can not be falsified and is therefore it is useless as a theory from which to derive predictions about eyewitness performance.

                                                iii.     Role that similarity structure plays in the lineup may be critical (see Clark, 2005 and Flowe, 2006) and we don’t understand quite how it works. People might be changing their criteria as a result of similarity relationships and/or they might have multiple criteria (one for memory comparison and one for whether their might be someone else in lineup who looks good enough). More theoretical work clearly needs to be done on this.

  1. It is critical for those who are interested in applying psychology to the real world to understand the nature of the research methods used and the purpose of those methods.
    1. Psych research is NOT designed to develop theories that can make “point” predictions.

                                                    i.     Theories in most physical sciences are tested by making point predictions (the speed of light is predicted to be…, under such and such conditions, the time it takes for an object to reach the ground is…)

                                                  ii.     Psychological theories do not make point predictions. They make more or less comparative predictions.

1.     Accuracy will be less with shorter than longer exposure times.

2.     Or accuracy will be better with longer than with shorter exposure times.

3.     Notice that this type of theory cannot tell us what the accuracy rate will be at a specific exposure duration, say 30 seconds.

4.     Note also that this type of theory (and research method) cannot tell us whether a witness who saw someone for 30 seconds is likely to be correct.

5.     What we can say is that witnesses who saw someone for 30 seconds is probably more likely to be correct than witnesses who saw someone for 1 second. Alternatively, 30 second witnesses are less likely to be correct than witnesses who say the culprit for 1 hour.

6.     Note further that we cannot tell from this theory and method whether the 30 second accuracy rate will be more like the 1 hour rate than the 1 second rate.

  1. Note that psychological research is very general (we average over large numbers of people and conditions). But often we forget that averages may be more or less representative of the actual underlying specific examples that have been averaged. For example, if we average these numbers 2, 2, 10, 10, we get 6. But not one example is a 6! This makes generalization difficult unless we have some idea how representative the general is of the specific examples.


[1] By the way, this result had been found in other studies of actual lineups (see Table 12 in Valentine, Pickering, & Darling, (2003). Characteristics of eyewitness identification that predict the outcome of real lineups. Applied Cognitive Psychology, 17, 969-993).

[2] At least one other field study of actual live lineups conducted in California found that witness confidence was highly predictive of witness choices. In particular, 95% of witnesses who expressed high confidence in their choices selected the suspect and 5% selected a filler. In contrast, 65% of witnesses who expressed moderate confidence in their choices selected suspects rather than fillers. (Behrman & Davey (2001) Eyewitness identification in actual criminal cases: An archival analysis. Law and Human Behavior, 25, 475-491)

[3] For some reason, the files that we coded were missing confidence information in a fairly large proportion of the cases. As a result, these data should be viewed with some caution.

[4] We have also argued in Ebbesen & Flowe (2003) that the sequential lineup procedure tends to raise the “good enough” standards that witnesses set. Thus, in general, witnesses viewing a sequential lineup will require a higher degree of match value than witnesses viewing simultaneous lineups. This would tend to explain why more suspects are picked with sequential lineups.

[5] This is equivalent to assuming that witnesses raise their criterion after they make their first choice in a continuous sequential lineup.

[6] Even if we use protocols that allow witnesses to continue picking after an early choice, we do not know, at this time, whether an early pick might not change the standards that witnesses use on later picks, e.g., “I already picked someone, so I better be really sure before I change and say it is someone else.”