Does Word Error Rate Matter?
Word Error Rate (WER) is a common metric for measuring speech-to-text accuracy of automatic speech recognition (ASR) systems. Microsoft claims to have a word error rate of 5.1%. Google boasts a WER of 4.9%. For comparison, human transcriptionists average a word error rate of 4%.
When comparing conversational AI solutions that automate interactions over telephony, is WER a good metric to gauge how well the virtual agent will understand you?
How to Calculate WER
Word Error Rate is a straightforward concept and simple to calculate – it’s basically the number of errors divided by the total number of words.
Word Error Rate = (Substitutions + Insertions + Deletions) / Number of Words Spoken
Where errors are:
- Substitution: when a word is replaced (for example, “shipping” is transcribed as “sipping”)
- Insertion: when a word is added that wasn’t said (for example, “hostess” is transcribed as “host is”)
- Deletion: when a word is omitted from the transcript (for example, “get it done” is transcribed as “get done”)
Variables that Affect WER
The issue with WER is that it does not account for the variables that impact speech recognition. For humans, the ability to distinguish between speech and background noise is fairly easy — if someone calls me from a concert, I can differentiate the speaker’s voice from the music that’s playing. But for machines, separating speech from background noise – even if it is music – is difficult to do.
In the absence of background noise, other factors significantly impact a machine’s ability to transcribe speech:
Accents and Homophones
Whether you realize it or not, you have an accent. In fact, everyone has an accent. The way we speak varies tremendously, even if we are native speakers of the same language. For example, I pronounce “aunt” like “ant” the insect. The American Heritage Dictionary also recognizes “aunt” — like “daunt” — as a correct pronunciation.
Understanding different accents and disambiguating homophones go beyond the capabilities of most, if not all ASR systems. Without contextual training or an NLU engine to correct the error, the sentence “I have ants in the kitchen” can be transcribed as “I have aunts in the kitchen.”
When two people speak over each other, it’s not too difficult to follow the conversation if you’re in the same room as they are. But hearing a recording and transcribing a conversation of two people speaking at the same time is difficult for humans, let alone for machines to get right.
How does the ASR know which voice to prioritize?
The answer – it doesn’t. Depending on the ASR system, different methods can be applied to handle crosstalk (one method is to omit a speaker’s words entirely) – all of which inevitably lowers the WER.
If a person is using a speakerphone, their distance from the microphone will impact audio quality and introduce ambient noise. If a person is calling through a landline or cell phone, the audio traverses through a telephony network that compresses the information to low fidelity, reducing the audio to a mere 8k resolution. Since most people aren’t speaking in a vacuum, the audio quality from these real-world scenarios make it less than ideal to transcribe.
Accurately capturing technical or industry-specific terminology takes skill and effort. For this reason, human transcriptionists often specialize in transcribing for a particular field (e.g. legal or medical) and charge more for their services because of their subject matter knowledge. Similarly, speech recognition systems trained on general data will struggle with complex or industry-specific language because it lacks that frame of reference.
WER is Flawed
When it comes to evaluating conversational AI solutions, keep in mind that the ASR is only one component of the technology stack, and WER is only one metric to evaluate ASRs — an imperfect one at that.
WER offers a myopic view of speech recognition because it only counts the errors and does not factor the variables causing the errors. Moreover, it doesn’t consider that some words are more important than others. Every word (whether it’s an article, noun, or verb) is weighted equally, even if it’s just one word that alters the meaning of the sentence.
WER as a Marketing Tool
Some of the very companies that boast low error rates also recognize that WER is not a good metric and that it “counts errors at the surface level.” What’s more critical to note is that many of these companies use the same corpus to train their models and test for accuracy. In other words, they’re achieving human-level accuracy because they are testing their ASR systems on the very same dataset used to train those systems.
So, take WER with a grain of salt — it’s more of marketing gimmick than a true measure of accuracy.
Calls in the Wild
At SmartAction, we see 20% of inbound calls are significantly impacted by noise — to the point where even a human would have trouble understanding. These “calls in the wild” are the true test for any ASR system.
To illustrate my point, check out this real-world conversation featuring a stranded driver calling AAA roadside assistance. Can you clearly hear and understand what the caller is trying to say?
On its own, even the best speech recognition engine will not transcribe the above conversation correctly. In fact, it didn’t. The ASR in the above example transcribed very literally what it heard because the “F” in “Ford F250” was barely audible. As a result, it only heard “Ord” then transcribe “Aboard” as the closest word match. Herein lies the problem with even the best speech rec engines – they are very 100% right. In fact, they are wrong quite often.
But as you heard, the AI correctly read back “Ford F250.” How did it do that? Because the ASR was backed by a Natural Language Understanding engine that knew what it was listening for. And it knew it was listening for vehicle makes and models. When the ASR transcription didn’t match an expected output, the NLU engine kicked in to compare the language acoustic models against what it expected to hear. The closes match was correct – “Ford F250.”
If your ASR isn’t backed by a NLU engine that has been tailored by developers to the specifics of your interactions across every question and answers, you ASR will not meet the expectations of your customers despite whatever the WER claim might be.
When it comes Word Error Rate and Automatic Speech Recognition, here are a few things to remember:
- WER is one metric to use – and by no means the only metric you should use.
- ASR systems have come a long way but are still far from perfect.
- The secret to great speech technology is not the ASR itself, but rather the associated NLU engine that augments accuracy.