Hearing is Believing – Why We Don’t Use Google
Now that I have your attention, this is why we don’t use Google…except for select use cases. As a company that manages the AI-powered CX for more than 100 brands, we often get asked about the place of Google or Amazon in our conversational AI technology stack. More specifically, since our reputation is staked on doing voice automation for contact centers better than anyone, we get asked about our reliance on their flagship service: speech-to-text, otherwise known as Advanced Speech Recognition (ASR). We have all seen the growing trend of do-it-yourself (DIY) chatbot platforms attempting to incorporate some of these APIs, including some of the dominant CCaaS players but admittedly primarily for chat.
While we like some of the services they offer (e.g. multilingual, dynamic TTS, etc.) to augment our stack, ASR is not one of them, at least not yet, except for rarer use cases that are open-ended in nature that require a transcription-based engine. In those cases, we don’t just like Google, we LOVE Google. I know, I know, how is it possible for a company like Google with their heritage and investment in speech recognition not be our “no-brainer” choice for every customer service interaction we automate.
Well, the answer comes down to one thing – telephony. The experience over telephony (low fidelity) is very different than the experience we’re accustomed to when speaking directly into a phone or home device (high fidelity). As soon as those sound waves travel across an outdated telephony infrastructure, not only is resolution reduced by more than half, but noise is introduced.
A developer could circumvent this problem if there was a way to (1) train their models with domain-specific low fidelity inputs and (2) tune the engine to only listen for a limited range of customer specific responses. However, that isn’t the case. These ASR engines are black-box services that can’t be cracked opened by any developer to tune the experience.
These are transcription-based engines that have to be all things to all people, which means using a statistical approach against every word or phrase under the sun that could be said…then attempt to transcribe one of a million utterances that achieved the highest confidence score. That works pretty good in high fidelity environments – straight from the voice to the device (as long as the device is online with an internet connection) – where their models have been trained. However, these are phone calls that run across low fidelity telephony infrastructure and reduce wave-to-image mapping to less than half the original resolution of “on-device” audio. When you strip out all the highs and lows, accuracy is so impaired that even simple “yes” or “no” questions become a challenge to accurately transcribe. The caller needs to speak clearly and intentionally, or it will fail.
If you have a hard time believing how poor the experience is over telephony, just try Google’s voicemail transcription service on any Android. Those are real phone calls that came over low-resolution telephony and landed in your voicemail before Google could begin the transcription. You’ll find out very quickly how accurate that engine is in comparison to speaking directly into your phone or home device in high fidelity. The results are wildly diverged.
To deliver the very best customer experience possible over voice, you have to take a machine learning approach that is purpose-built for this very problem and allow developers the means to customize and tune for specific grammars or utterances, so the language acoustic models of inputs can be weighted against the acoustic models of expected outputs.
I’ll explain.
In customer service, most interactions or use cases have a narrow range of expected outputs or answers to every question. Good CX designers will design their conversation flows in such a way to ensure this. If a developer knows the range of grammars that need to be accounted for against each question, the engine can be manually tuned to listen for expected responses.
This approach is what makes all the difference in the world. Admittedly, this is not a very scalable approach. It’s also why 8-weeks of development can sometimes be required to train and tune models for an interaction or set of grammars we’ve never supported before. However, it takes this level of customization across every interaction over the voice channel to really boost accuracy question-by-question.
In this case, our AI-brain isn’t listening for every utterance under the sun then transcribing according to the highest statistical confidence against every word in the dictionary. We’re not trying to boil the ocean for every utterance that exists. It is focused on the limited subset of customer-specific responses we expect to hear in order to extrapolate intent.
For example, if it’s a “Yes” or “No” question, we can predict the only outcome is “Yes” or “No.” That means whatever comes from the customer’s mouth will be weighted against the output it most closely resembles. This is why we believe we offer the very best “yes” or “no” accuracy in the industry.
If one of the grammars we’re listening for is “cat” but the utterance sounds closer to “hat,” our engine will identify “cat” while a transcription-based engine would identify “hat,” resulting in failed containment.
If the expected set of grammars to a given question were along the lines of “cat,” “hat,” “bat,” or “sat,” we would opt to use Google for that interaction since a transcription-based engine will outperform on grammars with similar acoustics.
This is not to say a domain-specific, rules-based engine only works best on narrow aperture use cases. It also significantly outperforms transcription-based engines on wide aperture use cases BUT ONLY IF there is a pattern or record to match against. A good example is address capture where we can match known street names as long as we capture the caller’s zip code first. The same is true for capturing vehicle make and model since there is a database of expected grammars to reference. However, when there’s a situation that involves a wide aperture of responses that doesn’t have a database of outputs to pattern match against, there is no element of prediction to hack the accuracy. In those cases, we opt for Google’s service since a statistical transcription-based approach will outperform every time.
Unfortunately, there is no “easy button” to doing speech recognition well for customer service over telephony. It comes down to knowing the right tool for any given job that will deliver the very best customer experience possible.
We expect an eventual future where we can hit the “easy button” on all this where Google or Amazon or Microsoft lead against all other approaches. However, we have a hard time seeing wholesale improvements in the experience until there is a telephony infrastructure refresh to ensure high fidelity end-to-end without noise.
To see what a great experience should sound like, see Hearing Is Believing for OnDemand demos of real customer interactions with IVA.