Skip to main content
Tech

OpenAI’s transcription tool Whisper makes up words patients have never said

The hallucination problem is especially worrisome as the tool becomes more widely used in medical centers, experts say.
article cover

Illustration: Anna Kim, Photos: Adobe Stock

5 min read

AI scribes were supposed to make doctors’ lives easier, but many researchers are finding inaccuracies in their transcriptions.

A popular transcription tool used in medical centers, OpenAI’s Whisper model, is embellishing transcripts with dialogue that doesn’t appear in audio recordings, according to a study presented at the 2024 ACM Conference on Fairness, Accountability, and Transparency.

Whisper is used by around 30,000 clinicians and 40 health systems to document interactions with patients, according to the Associated Press. But multiple studies have shown that Whisper produced fabricated or inaccurate transcriptions when it was tested on real-world audio like town hall recordings.

"We take this issue seriously and are continually working to improve the accuracy of our models, including how we can reduce hallucinations,” OpenAI spokesperson Taya Christianson told Healthcare Brew in an emailed statement.

Allison Koenecke, an assistant professor of information science at Cornell University and an author on the study, told Healthcare Brew that this analysis, along with other research projects she has conducted, highlight how AI transcription tools underperform when tested on speech data from people with accents, broken speech, or language disorders. In particular, Whisper inserted made-up statements that “include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority,” she and her colleagues wrote in their study. In healthcare settings, such errors could misrepresent patients with speech irregularities and result in inaccurate assessment records.

The researchers on the study ran audio data from Carnegie Mellon’s AphasiaBank, which captures interactions with people who have disfluences in speech, through Whisper. They divided up the AphasiaBank audio data into 13,140 segments, each 10 seconds long. After reviewing manually, they found that 187 segments contained hallucinations—which is when the text output is nonsensical or unfaithful to the audio input—about 1% of test cases. These text transcription hallucinations did not occur in other commercial speech-to-text models from Google, Amazon, AssemblyAI, and RevAI, all of which the researchers also audited.

What can be done?

The study found that silences at the beginning and end of an audio file seemed to directly trigger hallucinations. Trimming out audio silences in the file could reduce hallucinations, which can be done by adjusting the decibel threshold through the Whisper API, for example, Koenecke said.

But, most importantly, these AI tools need to be tested by outside auditors on diverse datasets with non-native English speakers and people with speech disorders or accents, she said. She added that auditing can be time and resource intensive since researchers have to find new testing data that the model has never seen before.

Navigate the healthcare industry

Healthcare Brew covers pharmaceutical developments, health startups, the latest tech, and how it impacts hospitals and providers to keep administrators and providers informed.

“If you ask [AI] questions that are slightly outside of its domain of knowledge or questions that are slightly rephrased, you might end up seeing more failure modes that are more indicative of how the machine might behave in a real-world scenario,” Koenecke said.

Zooming out

As third-party AI tools become increasingly integrated into existing processes, technologies, and practices, there are growing concerns around consent, disclosure, and privacy, Vardit Ravitsky, president of the nonpartisan bioethics research institute Hastings Center, told Healthcare Brew.

There’s not a lot of guidelines or regulations around how AI is used in clinical decision support systems or for ambient documentation, Kellie Owens, an assistant professor of population health at NYU Grossman School of Medicine, said. She believes individual institutions should monitor the safety and efficacy of these tools.

“Health systems are trying to debate internally how they’re going to handle the risks and benefits of this kind of technology that, on some levels, they have very little control over because they are using tools that they’re not building themselves,” she said.

Several state legislatures, including Colorado in May, are weighing policy and regulation proposals on how to govern the use of AI in healthcare. Ravitsky said one emerging trend related to AI in healthcare is the notion of keeping a “human in the loop”—aka not allowing these devices or generative tools, whether it’s for recording clinical notes or prognostics, to be used without provider oversight. “We don’t consider them to be autonomous,” she said. “It’s premature.”

Then there are issues around accountability and liability that could ultimately impact trust that patients have in the medical system. For example, someone must determine who will be held accountable when medical harm results from the use of AI, Ravitsky said. Is it the company who made the model, or is it the hospital or the individual clinician who used the tech and maybe missed something when they reviewed their notes?

“We don’t have clarity, either legally or ethically, on where the responsibility lies for harm, mistakes,” she said.

Navigate the healthcare industry

Healthcare Brew covers pharmaceutical developments, health startups, the latest tech, and how it impacts hospitals and providers to keep administrators and providers informed.