Taking a Closer Look at Forensic Science Behind U.S. Criminal Justice

This story was originally published in our November/December 2021 issue as "Fixing Forensics." Click here to subscribe to read more stories like this one.

Statistics research doesn't usually require weapons. But to develop their latest algorithm, Iowa State University statisticaians Alicia Carriquiry and Heike Hofmann needed thousands of bullets fired from a small collection of handguns. So they put the firepower in their own hands, and hit the range.

For nearly a year, Carriquiry and Hofmann, supervised by sheriff’s deputies, unloaded round after round into a tube with Kevlar fibers. After each shot, they fished out the bullet and tucked it in a plastic baggie labeled with critical data: gun, barrel, shot number.

“If you had asked me a few years ago whether I was going to be doing this type of data collection, I would have said, ‘You’re crazy,’ ” says Carriquiry.

It’s not just bullets, either. Over the past several years, the professors and their collaborators have amassed 160 pairs of well-trod sneakers, over 2,000 handwriting samples and 129 poster boards spattered with pig blood. Scanned and digitized, the items became first-of-their-kind data for research on crime scene evidence.

On detective shows and in reality, the task of analyzing clues like fingerprints and ransom notes usually falls to forensic examiners, not scholars of statistics. But Carriquiry and Hofmann belong to a movement of academic outsiders investigating the foundations and legitimacy of forensic science. An open secret drives their work: Fingerprints, bloodstains and other forms of forensic evidence entered the justice system without scientific vetting, and have largely evaded scrutiny ever since. About a century ago, crime labs run by law enforcement — not scientists — began developing methods to connect clues to culprits, like fingerprint identifications. But peer-reviewed studies never took place to establish the methods’ validity, reproducibility and error rates — key criteria that distinguish science from speculation.

“It’s mind-boggling how they got away with it for decades,” says University College London cognitive neuroscientist Itiel Dror, another academic who has uncovered faults in forensics.

The flimsy science came to light in 2009, with a blockbuster 350-page report authored by a National Research Council (NRC) committee of scientists, judges, lawyers and forensic practitioners who had spent two years reviewing the field. Unanimously, they concluded only straightforward DNA identifications met scientific standards. The rest: “Essentially hocus pocus,” says Carriquiry.

Keith Harward, waving here as a free man in 2016, spent 33 years in prison after flawed bite mark evidence linked him to a murder. (Credit: Richmond Times-Dispatch/Alamy)

That so-called sorcery has destroyed lives. The Innocence Project, a non-profit legal organization, found dubious forensics contributed to about half the wrongful convictions the group has overturned with DNA testing in the U.S. since 1992, including 14 death row sentences. The National Registry of Exonerations, a public database maintained by three universities, lists some 670 cleared cases, between 1989 and 2021, that originally involved false or misleading forensic evidence, ranging from bunk bite marks to smudgy fingerprints. Collectively, the innocent have languished in prison for thousands of years, while true perpetrators roam free.

Solutions proposed by academics could help fix forensics. Yet, many crime labs have resisted changes, blaming mistakes on bad apples, while courts continue to say precedent protects forensic evidence. Ultimately, scientists are learning they can run the first leg of this race — establishing validity and reliability. Then, they must pass the baton to judges and lawyers, and hope they will carry reforms to the finish line.

Seeking Evidence-Based Solutions

Keith Harward was serving in the U.S. Navy in 1982, when he was convicted of a rape and murder in Newport News, Virginia. Harward spent 33 years in prison after six forensic examiners concluded his teeth matched bite marks on the victim. The Innocence Project performed DNA testing that cleared Harward in 2016, and linked the deed to another sailor, who had committed more crimes in the meantime. Over the years, bite mark analysis has contributed to at least 25 documented wrongful convictions or indictments, while scholars have widely debunked its validity. Multiple experiments have shown how examiners cannot reliably distinguish bites from other bruises, let alone tie the marks to a specific person’s teeth.

Critical research like this arose from the 2009 NRC report, which sparked some outcry and reform initiatives, then hit a wall. “They failed to penetrate the courts,” says Simon Cole, professor of criminology, law and society at the University of California, Irvine. Several years later, a 2016 report from the President’s Council of Advisors on Science and Technology (PCAST) documented little overall progress on scientific standards for forensic evidence. In 2017, a reform-minded advisory panel, created in the Obama era, was terminated by Trump’s Department of Justice — a move applauded by the National District Attorneys Association.

Software, as well as experts in firearms identification, can now analyze images side by side to compare microscopic marks on bullets or casings. (Credit: Courtesy of CSAFE)

Carriquiry is left roiling at the mention of discredited methods like bite-based identifications. “Oh my God, talk about unscientific and unproven,” she says. In 2016, the Texas Forensic Science Commission recommended a moratorium on the use of bite mark evidence in criminal justice. Yet courts across the U.S. continue to admit bite mark evidence, because of precedent.

For those pushing reforms to forensics within the legal system, “It’s a lot of beating one’s head against the wall,” says David Kaye, a law professor emeritus at Penn State and Arizona State universities, who has served on numerous federal committees concerned with standards in forensic science.

Meanwhile, academics forge ahead, running experiments and analyzing data, to improve forensics. “We just want to stay above the fray and do the science behind it all,” explains Carriquiry. She leads a consortium at the forefront of this effort, the Center for Statistics and Applications in Forensic Evidence (CSAFE), with more than 80 researchers across several universities. The group’s members include physicists, engineers and computer scientists, as well as professors of law, criminology and forensic science itself.

When the group formed in 2015, CSAFE faced cold shoulders from forensic practitioners, the professionals examining actual evidence in real cases. They feared that academics, with no practical experience, had come to trounce their methods and livelihoods. Waving white flags, CSAFE leadership promised partnerships, which have been borne out over the past six years. The professionals offer expertise and guidance. The professors undertake laborious studies, which wouldn’t be feasible for examiners with heavy caseloads and limited resources.

“The research that academia does is amazing,” says Texas-based forensic examiner Stephanie Luehr. “As long as they’re consulting a practitioner.” Luehr specializes in firearms identification, which traces spent ammunition back to the shooter’s weapon. CSAFE research could improve how Luehr and other professionals carry out their work — and assuage dire concerns raised in the 2009 NRC and 2016 PCAST reports.

One such worry, permeating both reports, is that popular forensic disciplines are unabashedly subjective. Trained examiners judge whether two pieces of evidence look similar enough to call them a match — say, a fingerprint from a crime scene and a suspect. But such a nebulous threshold, similar enough, means two experts can reach different conclusions, given the same evidence.

This subjectivity does not sit well with scientists. “You need the examiner’s eyes and their brain to come up with a conclusion,” says Hofmann, the Iowa State statistician who fired bullets with Carriquiry.

The NRC and PCAST reports also warned that most forensic methods have never been subjected to proof-of-concept studies, to establish validity and error rates. In science, these are must-haves. Without these checks, no one knows how often purported matches are wrong and true matches are missed.

Statistics for Ballistics

CSAFE scientists aim to remedy these shortcomings, and to dial back forensics’ subjectivity with cutting-edge objective methods. In particular, they’re keen on developing software programs that automatically assign pairs of clues a similarity score: a value between 0 and 1, where 1 indicates a near-certain match, 0.6 or 0.5 would be more uncertain, and 0 means the items bear little resemblance. Two experts could no longer reach different conclusions for the same bullets, prints or handwriting samples; they would run that evidence through CSAFE’s software and receive an identical similarity score.

On this front, CSAFE has made the most progress with bullet matching. When a gun fires, micro-imperfections in the barrel carve distinctive scars into bullets. Examiners like Luehr use microscopes to visually compare the engraved features on two bullets — one from a crime scene, another test fired from a suspect’s gun. Then they decide whether the samples come from the same gun, come from different guns or if the answer is unclear.

That decision can now rest on an objective score, thanks to CSAFE’s new software. To develop the program, Hofmann and colleagues 3D-scanned bullets, collected from the CSAFE shooting excursions and partnering police departments. Then they fed pairs of scans into a machine-learning algorithm, along with a key variable: whether the bullets were mates fired from the same gun, or non-mates from different guns. Training with this dataset, the computer taught itself to read bullets’ microscopic grooves and scratches in order to compute the likelihood that two bullets are mates.

Now, when the resulting software receives scans of mystery bullets, which may or may not come from a single gun, it can assign them a similarity score between 0 and 1. This new software compares bullet micro-scars, just as examiners do. But, unlike human minds, the program churns out the same score for a given bullet pair, no matter who runs the analysis.

Adding Probability to Penmanship

Like firearms identification, handwriting analysis usually entails subjective, visual assessments. An examiner compares samples of penmanship to decide if a single person wrote the texts. That conclusion may figure prominently in cases involving ransom letters, suicide notes and forged documents.

(Credit: LiliGraphie/Shutterstock)

But penmanship poses additional challenges. Whereas gun barrels typically mark bullets with consistent patterns, an individual’s handwriting varies day to day, morning to night and by context. Consider how your writing differs between a grocery list and a letter to Grandma. For similarity score calculations to work, those day-to-day, note-to-note fluctuations in one person’s marks must be subtler than differences between the writing of any two individuals.

CSAFE scientists have been building the datasets necessary to test this premise. During her doctoral research at Iowa State, statistician Amy Crawford oversaw the collection of more than 2,400 handwriting samples. Her 90 participants scribbled three prompts on three occasions, spaced weeks apart. One prompt, The London Letter, contained every letter, digits 1 through 9 and common punctuation. Packing all that in made for awkward prose, such as “Dr. L. McQuiad and Robert Unger, Esq., left on the Y.X. Express tonight.” To capture natural, flowing script, Crawford also chose an excerpt from The Wizard of Oz. And the third prompt was a short phrase, the kind a burglar might scribble down in a hurry.

It’s not clear yet if similarity scores will be feasible for handwriting analysis, but Crawford and colleagues have already made the next best thing. They created software, reported in February in Statistical Analysis and Data Mining, that estimates authorship probabilities between a specific set of possible writers. Out of, say, 100 candidates, it identifies the person most likely to have written a document.

The analyzing software breaks words into pieces, distinguished by end points, sharp turns and intersecting lines. The characters often correspond to alphabet letters, but not always. Similar characters are grouped into about 40 clusters, and then the algorithm counts how often each writer uses a cluster and varies within that cluster. For example, a cluster could comprise all single lines that deviate a few degrees from vertical and undulate a bit from straight. Using this data, the software calculates the most likely author from a group of suspects. This could help solve real cases, including one of the most famous U.S. kidnappings.

Back in 1956, Betty Weinberger tucked her 1-month-old, Peter, into his carriage on the patio of her suburban Long Island home. She went inside for a moment and returned to an empty carriage and a ransom note — a dozen lines of hurried text, with apologies, demanding $2,000 for baby Peter’s life.

With only that evidence, and a second note left six days later, the FBI launched a herculean search. They reviewed nearly 2 million bureaucratic documents from schools, factories and government agencies in search of the kidnapper’s handwriting. After six weeks, agents found similar-looking scrawl on a parole form of a man previously nabbed for bootlegging. Confronted with the ransom notes, he confessed, but had already abandoned Peter in brush. Detectives later found the infant’s diaper pin and decomposed remains.

Crawford contends the CSAFE algorithm would have sped up the search, perhaps enough to save the child. Investigators could have fed a computer hundreds of documents at a time, and the program would have identified the most likely writer within each set. That would have narrowed down the suspects in days, rather than weeks.

Challenging Fingerprint Infallibility

Academics have also cast criticisms on fingerprints, the public’s favorite form of forensic evidence. Fingerprints reflect minute dermal ridges, which form between three and six months after conception and stick with humans after death. Although the formation process isn’t entirely understood, genetics play a role. So do localized stresses, in the womb and from fetal nerves. Consequently, even identical twins — indistinguishable by nuclear DNA — have distinct whorls, loops and arches adorning their finger pads. Ink up those baby fingers, and they’ll leave different marks.

Looking at a cleanly pressed fingerprint, “you see how complex it is,” says Cole, the professor at the University of California, Irvine, and a CSAFE member. Immediately, you might think: “Well, this must be a very powerful identifier,” he says. That notion extends back ages, as Cole discovered when he wrote the book Suspect Identities: A History of Fingerprinting and Criminal Identification, published in 2001.

In ancient China, as early as 220 B.C., people authenticated documents with clay seals, stamped with their name and fingerprint. The modern understanding that prints can solve whodunits emerged in the late 1800s. In the first scientific paper to make this point, in an 1880 issue of Nature, medical missionary Henry Faulds reported his successful applications of the method: “In one case, greasy finger-marks revealed who had been drinking some rectified spirit.” Prints soon debuted in popular fiction, as key clues in stories by Mark Twain and a Sherlock Holmes tale by Sir Arthur Conan Doyle.

(Credit: Nicola Forenza/Shutterstock)

In 1910, fingerprint evidence first appeared in courts, and was ruled admissible by U.S. appellate judges in 1911. A 1985 FBI manual described the identification method as “infallible.” And in 2003, the head of FBI’s fingerprint unit insisted that the error rate was “zero,” in several court testimonies, an L.A. Times story and a 60 Minutes episode.

Seeing the FBI agent’s claim in the press, Cole knew it wasn’t true, based on his book research. In 2005, he published an article in The Journal of Criminal Law & Criminology titled “More Than Zero,” which detailed 22 erroneous fingerprint assignments, known from the public record. In one case, an examiner matched a thumbprint from an unidentified corpse to a woman’s inked prints on file with the California Sheriff’s Office. Law enforcement informed her mother, who grieved and prepared a funeral. Then the estranged daughter was found in northern California. She was not dead, only misidentified.

Cole’s list also included the most high-profile fingerprint error: The FBI arrested Oregon attorney Brandon Mayfield for his alleged involvement in the 2004 Madrid train bombing, which killed 191 people. There was no record of Mayfield traveling abroad for 10 years. But FBI examiners matched a print from an explosives-filled bag in Madrid to Mayfield’s prints in their database. According to a government affidavit, the FBI considered the match to be a “100 percent identification” and “verified.” Two weeks after the arrest, the FBI retracted the match and released Mayfield, who sued the FBI for religious profiling based on his Muslim faith. The Spanish National Police found a man whose prints more closely resembled the crime-scene prints in Spain.

More proof of the method’s fallibility came to light in 2006 from the research of cognitive neuroscientist Dror. He showed expert examiners a pair of fingerprints that they had previously analyzed in real court cases and had judged to be definite matches under oath. In the experiment, the examiners weren’t told they had seen the prints before, and Dror provided additional, fabricated details that gave the suspect an alibi. The conclusions of most examiners changed to “no match” or “inconclusive.”

“Everyone was in shock,” recalls Dror. “I feel like I was a whistleblower.” After publishing the initial experiments, Dror received torrents of hate mail and personal attacks from forensic practitioners. But in the 15 years since, study after study has strengthened his message: Systematic, personal and unconscious biases influence subjective methods, like fingerprint matching. Gradually, forensic practitioners have accepted this and opened to reforms.

The research has also revealed ways to improve accuracy. Dror says one simple solution is to provide forensic practitioners the relevant information they’re tasked to examine, and nothing more. They shouldn’t hear the backstory of the case; that could cloud their judgments. Luehr, the firearms examiner, agrees this is a reasonable safeguard against unconscious biases: “The less you know, the better,” she says.

Beyond that, the quality of the print itself strongly influences the likelihood of a correct interpretation. And pristine prints rarely surface at crime scenes. “Criminals are not nice enough to leave a perfect print someplace,” says Carriquiry.

After a Madrid train bombing in 2004, the FBI made a false arrest based on fingerprint evidence that claimed a “100 percent identification” — then a better match surfaced. (Credit: Associated Press)

One study published in 2018 reviewed proficiency tests that crime labs give their employees, asking them to match about a dozen fingerprints with known answers. A sample of annual tests given between 1995 and 2016 showed that 7 percent of participants missed at least one true match, and 7 percent incorrectly matched fingerprints from different people. That was while working with relatively clean prints, rather than actual crime cases.

In real-world scenarios, examiners use partial, smudged, low-quality images to search for a match within databases of a few million subjects at state law enforcement agencies. National agency databases hold up to a hundred million options. Here, too, new algorithms could make a difference. In 2018, the U.S. Army Criminal Investigation Laboratory released FRStat, the first widely available software to calculate similarity scores for sets of fingerprints.

Stats on the Stand

As researchers improve forensic science, some trial lawyers feel hampered by recommended reforms that make little difference in practice. After all, forensic conclusions, like matching bullet grooves, are just one piece of the overall case. In Luehr’s words, “I just speak for the evidence. I’m not responsible for putting the gun in someone’s hand.” It’s up to prosecutors to review the totality of evidence, and when they believe a suspect is guilty, to present a full and factual narrative — one that places the suspect at the scene of the crime with a weapon in hand and motive in mind. If the rest of the story holds, does it matter if the chances of forensic error are 1 in 100 vs. 1 in a billion?

Latent fingerprints found on a crime scene are seldom pristine. These smudged and partial samples appeared in a 2011 study in Proceedings of the National Academy of Sciences. Five out of 169 examiners in that study made false positives when comparing these prints. (Credit: PNAS study, May 2011, Accuracy and Reliability of Forensic Latent Fingerprint Decisions)

Seasoned prosecutor Matt Murphy doubts it. During his 26 years in California’s Orange County District Attorney’s Office, Murphy argued more than 200 criminal trials. He saw gut-wrenching cases of rape, serial murder and one kidnapping and torture involving a blowtorch, bleach and worse in the Mojave Desert. Lawyers often deal with nightmare realities that can make concerns about statistical algorithms seem like nitpicking from an ivory tower.

“Your goal is to achieve justice,” says Murphy. That begins, he adds, with prosecutors only charging cases backed by solid evidence. During his time as a DA, Murphy filled filing cabinets with crimes he declined to charge until more clues mounted. “You never read about all the cases that get refused,” he says.

Take the 1988 murder of Malinda Godfrey Gibbon, a pregnant newlywed who was brutally raped and stabbed to death with a kitchen knife in her Costa Mesa, California, home. Investigators found a fingerprint on Gibbon’s refrigerator, which matched prints in the national database belonging to a convicted felon. But it turned out the man worked on an assembly line where the refrigerator was manufactured in Tennessee. The innocent man never set foot in California. The cold case sat over 15 years until DNA tied the murder to another man. Only then did Murphy bring the case to trial, resulting in a death sentence in 2014.

For the cases that do make it to court, prosecutors have to convince the jury of a suspect’s guilt. And lawyers know complex statistics don’t sway jurors. What does work: showing big, blown-up images of seemingly indistinguishable prints or DNA profiles from the suspect and crime scene.

“If [jurors are] looking at that, you can tell them whatever statistics you want. Maybe it’s helpful. Maybe it’s not. But I would certainly ask them to base their decisions more on what they see for themselves,” Murphy explains. “Who are you going to believe, the statistician or your own lying eyes?”

Inside a Juror’s Mind

Cemented through decades of arguing cases, Murphy’s views about stats on the stand are backed by data. CSAFE researchers have run mock-trial experiments in which hundreds of participants judge the same case with variable tweaks in the forensic testimony. Ironically, the studies show that testimonies that include CSAFE’s state-of-the-art statistics do not change jury opinions.

In part, that’s because most folks misunderstand mathematical probability. One experiment included a four-question quiz about basic statistics. Of the 1,450 participants, over 60 percent rated themselves as “somewhat good to extremely good with percentages.” Yet less than 2 percent of participants correctly answered the four questions. (Try answering two of them yourself in the blue box to the right.)

But the experiments revealed another obstacle, deep in the minds of jurors. When it comes to fingerprints, people tend to trust a potential match. The type, strength and wording of statistics doesn’t matter. “They’ve been hearing for so many years that the evidence is unique or perfect or infallible,” says Brandon Garrett, a Duke University law professor who leads CSAFE research on lawyer and jury perceptions of forensic evidence.

Other info might shake their faith, though. “The conclusion wording isn’t as important as explaining to people that this evidence is fallible,” Garrett says. He saw jury opinions shift when expert witnesses stated under oath that prints can be misidentified. Skepticism also grew if jurors heard the expert scored less-than-perfect on a routine proficiency test administered in crime labs, according to a 2019 paper by Garrett, published in Behavioral Science & the Law.

The burden falls on lawyers to assess evidence reliability, and not just so they can suavely persuade jurors. The reality is that less than 10 percent of criminal cases result in a jury trial. The majority are negotiated by prosecutors, defenders and judges in plea deals. And evidence — such as fingerprints and bullet casings — figure prominently in these negotiations. To bring criminals to justice, attorneys and judges need to understand the statistics and science underlying forensic analyses.

But they don’t teach science in law school. Few lawyers understand how science works, according to Kaye, the law professor emeritus. He doesn’t mean technical details of particular methods, but rather deeper understanding of how scientists conduct research and generate knowledge, which can challenge, update or overturn what we thought we knew.

“The law needs to get away from just saying, ‘Because things have been accepted in the past, they also continue to work,” Kaye says.

There lies the deepest divide between science and the law: The legal system rests on precedent. Earlier judgments constrain future decisions. But science poses a healthy assault on previous thought, and has done so since the Enlightenment. CSAFE members and other academics will continue to rethink and revise yesterday’s forensics. It will be up to judges, lawyers and other gatekeepers, however, to decide whether that scientific progress ever has its day in court.