Hearing is Believing – Why We Don’t Use Google
As a company that manages the AI-powered CX for more than 100 brands, we often get asked about the place of Google or Amazon in our conversational AI technology stack. More specifically, since we have staked our reputation on doing voice automation better than anyone, we get asked why we haven’t yet incorporated their flagship service: speech-to-text, otherwise known as Advanced Speech Recognition (ASR). We have all seen the growing trend of Contact Center as a Service (CCaaS) platforms integrate with these services to offer DIY self-service capabilities.
While we like some of the services they offer (e.g. multilingual, dynamic TTS, etc.) to augment our stack, ASR is not one of them, at least not yet. I know. We can’t believe it either. The irony is how much we assumed their ASR would be at the very top of our list for adoption.
After all, no one has invested more into advancing speech recognition to the consumer on devices than Amazon or Google. Doesn’t that heritage make them the incumbent, “no-brainer” choice for speech-to-text for contact centers?
So that’s what we thought too… until we tried it.
The experience over telephony (low fidelity) is very different than the experience we’re accustomed to when speaking directly into a device (high fidelity).
Theoretically, a developer could circumvent the problem if there was a way to train their models with low fidelity inputs and tune the engine to only listen for a limited range of customer specific responses. However, that isn’t the case. These ASR engines are black-box services that can’t be trained, tuned, or improved in any way by the end user. If your self-service application doesn’t deliver a great experience, there is nothing you can do as a speech developer to improve the experience.
These are statistical-based engines that are listening for every word or phrase under the sun that could be said, then attempt to transcribe one of a million utterances that achieved the highest confidence level. That works pretty good in high fidelity environments – straight from the voice to the device (as long as the device is online with an internet connection) – where their models have been trained. However, these are phone calls that run across low fidelity telephony infrastructure and reduce wave-to-image mapping to low resolution – usually less than half the resolution of “on device” audio. That strips out more than half of the highs and lows their models were trained on.
If you have a hard time believing how poor the experience is over telephony, just try Google’s voicemail transcription service on any Android. Those are real phone calls that came over low-resolution telephony and landed in your voicemail before Google could begin the transcription. You’ll find out very quickly how accurate that engine is in comparison to speaking directly into your phone or home device in high fidelity. The results are wildly diverged.
Even the very simplest utterances like “yes” or “no” can be a real struggle for these ASR services over telephony. It has to be spoken clearly and intentionally or it will fail.
The difference against our own proprietary speech rec is that we’re not listening for every utterance under the sun then transcribing according to the statistical confidence against every word in the dictionary. In customer service, you’re not trying to boil the ocean for every utterance that exists. You only have a limited subset of customer-specific responses that you’re listening for in order to extrapolate intent.
For that reason, we take a different approach that’s built and tuned against customer-specific criteria. In the example above, we know the only outcome is “yes” or “no.” That means we’re ONLY listening for the closest thing that resembles a “yes” or a “no.” There is nearly always a pattern to identify that we’re listening against, and it takes that level of customization across every interaction over the voice channel to really boost accuracy.
Moreover, all the inputs that have gone into training our language models have been based off actual phone calls – low fidelity audio. Unfortunately, there is no “easy button” to doing speech recognition well for customer service over telephony. At least, not yet.
We expect Google or Amazon or Microsoft to lead in this area eventually, at which point we couldn’t be happier to introduce into our stack. However, we have a hard time seeing wholesale improvements in the experience until there is a telephony infrastructure refresh to ensure high fidelity end-to-end.
To see what a great experience should sound like, visit Hearing Is Believing for OnDemand demos of real customer interactions with IVA.