SmartAction
SmartAction Social Widget
  • Why SmartAction
  • Solutions
    • Industries
      • Financial Services
      • Healthcare
      • Insurance
      • Retail
      • Service Providers
      • Travel & Hospitality
      • Utilities
    • Use Cases
      • Accounts & Membership
      • Authentication
      • Billing & Payments
      • Claims
      • Collections
      • FAQ
      • Natural Language Greeting
      • Order Management
      • Registration & Warranty
      • Reservations & Scheduling
      • Tech Support
    • Channels
      • Voice Virtual Agent
      • Digital Virtual Agent
  • Pricing
  • Customers
  • Resources
    • Resources
      • Case Studies Our Proven Track Record of Solving Business Problems
      • Blog Get Our Latest Blogposts Straight to Your Inbox
      • Research & Insights Educate Yourself Further on Our Industry
      • News Never Miss a SmartAction Update
      • Events Join Us at an Industry Event
      • Webinars Never miss a webinar
  • Company
  • Contact Us
  • Support
  • REQUEST DEMO
why-we-dont-use-google-blog_824x471_main
July 7 2020

Hearing is Believing – Why We Don’t Use Google

Brian Morin

Now that I have your attention, this is why we don’t use Google…except for select use cases. As a company that manages the AI-powered CX for more than 100 brands, we often get asked about the place of Google or Amazon in our conversational AI technology stack. More specifically, since our reputation is staked on doing voice automation for contact centers better than anyone, we get asked about our reliance on their flagship service: speech-to-text, otherwise known as Advanced Speech Recognition (ASR). We have all seen the growing trend of do-it-yourself (DIY) chatbot platforms attempting to incorporate some of these APIs, including some of the dominant CCaaS players but admittedly primarily for chat.

While we like some of the services they offer (e.g. multilingual, dynamic TTS, etc.) to augment our stack, ASR is not one of them, at least not yet, except for rarer use cases that are open-ended in nature that require a transcription-based engine. In those cases, we don’t just like Google, we LOVE Google. I know, I know, how is it possible for a company like Google with their heritage and investment in speech recognition not be our “no-brainer” choice for every customer service interaction we automate.

Well, the answer comes down to one thing – telephony. The experience over telephony (low fidelity) is very different than the experience we’re accustomed to when speaking directly into a phone or home device (high fidelity). As soon as those sound waves travel across an outdated telephony infrastructure, not only is resolution reduced by more than half, but noise is introduced.

A developer could circumvent this problem if there was a way to (1) train their models with domain-specific low fidelity inputs and (2) tune the engine to only listen for a limited range of customer specific responses. However, that isn’t the case. These ASR engines are black-box services that can’t be cracked opened by any developer to tune the experience.

These are transcription-based engines that have to be all things to all people, which means using a statistical approach against every word or phrase under the sun that could be said…then attempt to transcribe one of a million utterances that achieved the highest confidence score. That works pretty good in high fidelity environments – straight from the voice to the device (as long as the device is online with an internet connection) – where their models have been trained. However, these are phone calls that run across low fidelity telephony infrastructure and reduce wave-to-image mapping to less than half the original resolution of “on-device” audio. When you strip out all the highs and lows, accuracy is so impaired that even simple “yes” or “no” questions become a challenge to accurately transcribe. The caller needs to speak clearly and intentionally, or it will fail.

If you have a hard time believing how poor the experience is over telephony, just try Google’s voicemail transcription service on any Android. Those are real phone calls that came over low-resolution telephony and landed in your voicemail before Google could begin the transcription. You’ll find out very quickly how accurate that engine is in comparison to speaking directly into your phone or home device in high fidelity. The results are wildly diverged.

To deliver the very best customer experience possible over voice, you have to take a machine learning approach that is purpose-built for this very problem and allow developers the means to customize and tune for specific grammars or utterances, so the language acoustic models of inputs can be weighted against the acoustic models of expected outputs.

I’ll explain.

In customer service, most interactions or use cases have a narrow range of expected outputs or answers to every question. Good CX designers will design their conversation flows in such a way to ensure this. If a developer knows the range of grammars that need to be accounted for against each question, the engine can be manually tuned to listen for expected responses.

This approach is what makes all the difference in the world. Admittedly, this is not a very scalable approach. It’s also why 8-weeks of development can sometimes be required to train and tune models for an interaction or set of grammars we’ve never supported before. However, it takes this level of customization across every interaction over the voice channel to really boost accuracy question-by-question.

In this case, our AI-brain isn’t listening for every utterance under the sun then transcribing according to the highest statistical confidence against every word in the dictionary. We’re not trying to boil the ocean for every utterance that exists. It is focused on the limited subset of customer-specific responses we expect to hear in order to extrapolate intent.

For example, if it’s a “Yes” or “No” question, we can predict the only outcome is “Yes” or “No.” That means whatever comes from the customer’s mouth will be weighted against the output it most closely resembles. This is why we believe we offer the very best “yes” or “no” accuracy in the industry.

If one of the grammars we’re listening for is “cat” but the utterance sounds closer to “hat,” our engine will identify “cat” while a transcription-based engine would identify “hat,” resulting in failed containment.

If the expected set of grammars to a given question were along the lines of “cat,” “hat,” “bat,” or “sat,” we would opt to use Google for that interaction since a transcription-based engine will outperform on grammars with similar acoustics.

This is not to say a domain-specific, rules-based engine only works best on narrow aperture use cases. It also significantly outperforms transcription-based engines on wide aperture use cases BUT ONLY IF there is a pattern or record to match against. A good example is address capture where we can match known street names as long as we capture the caller’s zip code first. The same is true for capturing vehicle make and model since there is a database of expected grammars to reference. However, when there’s a situation that involves a wide aperture of responses that doesn’t have a database of outputs to pattern match against, there is no element of prediction to hack the accuracy. In those cases, we opt for Google’s service since a statistical transcription-based approach will outperform every time.

Unfortunately, there is no “easy button” to doing speech recognition well for customer service over telephony. It comes down to knowing the right tool for any given job that will deliver the very best customer experience possible.

We expect an eventual future where we can hit the “easy button” on all this where Google or Amazon or Microsoft lead against all other approaches. However, we have a hard time seeing wholesale improvements in the experience until there is a telephony infrastructure refresh to ensure high fidelity end-to-end without noise.

To see what a great experience should sound like, see Hearing Is Believing for OnDemand demos of real customer interactions with IVA.

Featured Content

The Ultimate Guide to AI Self-Service Without Compromise

With over 100 conversational AI deployments and nearly a dozen in the Fortune 500, this eBook shares our most important insights to self-service that works and is packed with real-world customer examples.

READ eBOOK

Related Posts

rapid_rise_COVID_v02

Blog

The Rapid Rise of Virtual Health Agents in Patient Access

art_of_listening_featured_v01

Blog

Going Beyond Listening: The Importance of Understanding in Conversational AI

make_or_break_8

Blog

3 Things Make or Break your Conversational AI Experience

Recent Posts

  • rapid_rise_COVID_v02The Rapid Rise of Virtual Health Agents in Patient Access
    December 3, 2020
  • art_of_listening_featured_v01Going Beyond Listening: The Importance of Understanding in Conversational AI
    December 3, 2020
  • make_or_break_83 Things Make or Break your Conversational AI Experience
    November 4, 2020
  • choice_blogFrom Skeptic to Believer: The Choice Hotels Story on AI-powered Virtual Agents for Voice
    October 23, 2020

Contact us

We are here to answer any question you may have about SmartAction.

Contact us
SmartAction
Contact us
  • Solutions By Industry
    • Financial Services
    • Healthcare
    • Insurance
    • Retail
    • Service Providers
    • Travel & Hospitality
    • Utilities
  • Channels
    • Voice Virtual Agent
    • Digital Virtual Agent
  • Resources
    • Case Studies
    • Blog
    • Research & Insights
    • News
    • Events
    • Webinars
  • Company
    • About Us
    • SmartAction Partners
    • Meet the Team
    • Careers
©2021 SmartAction LLC Privacy Policy Support Site Map

Please fill out the form to subscribe to SmartAction email updates. You can view our privacy policy here.

    Email Address*