Using voice assistants in business applications

Apr 08, 2020

|

Hlib Teteryatnikov

Introduction

While voice assistants are very popular in consumer applications, the enterprise market is starting to explore possibilities of contactless interaction with end-users by implementing voice-based technologies.

In this blog post, we are going to show what’s the difference between simply using Amazon Alexa with implementing a custom voice assistant using AWS Alexa SDK and other tools.

What's a voice assistant?

Voice assistant is a piece of software that uses commodity or specialized hardware, speech recognition, natural language processing and speech synthesis to replace traditional controls like mouse, keyboard, touchscreen with voice commands. Right now the consumer market of voice assistants is being dominated by three tech giants: Amazon, Google, Apple. Each of these companies easy-to-use devices that enable end-users to control multimedia devices, listen to the latest news and even talk to their smart home devices.

On the other hand, business is only yet to adopt voice-based controls. There are many reasons for that, including security aspects, mass-deployment issues and speech recognition limitations in a case of noise or other sound distortions, like echo.

We believe that this is a prominent technology, especially if used in a combination with other human-computer interaction methods. In this blog post we are going to check possibilities of Amazon Alexa, AWS Alexa SDK and other libraries.

Possible Implementation Approaches

We'll go through two possible approaches of implementing voice control in business applications. First, we can use Amazon Alexa or Amazon Alexa for business as a ready-to-use hardware/software solution. Also we can use Amazon Alexa SDK or another SDK to develop a custom voice assistance according to our needs. Both approaches have their own pros and cons.

Alexa Pros

  • A rich selection of ready-to-use skills provided by third-party tools. Alexa can be integrated with many smart home automation devices and software platforms.
  • A simple mechanism of creating new Alexa skills. Basically, a skill is an AWS Lambda function.
  • It’s possible to create a skill using virtually any programming language.
  • It’s very easy to build and debug new skills by means of AWS. For example, AWS Lambda can be created and debugged in the AWS Cloud9, while logs can be viewed using AWS CloudWatch.
  • Customizable error messages, fallback messages or help messages
  • Multiple “Skills” per device. You can write many skills and use them on Alexa. Example: You can write your own “Geography guru” skill to ask any geography questions and to write a skill with documentation which will give you answers about your existing project.
  • Simple and quick delivery of updates. Since each skill is an AWS Lambda function, updating the skill is as simple as publishing a new version of the AWS Lambda function.
  • Integrates well with an existing AWS structure. As any other AWS Lambda function, Alexa skill can be configured to have access to any AWS service or resource.

Alexa Cons

  • At this moment it’s impossible to restrict which skills should be available to users. This can be confusing and not very productive.
  • You need to say skill names to call your custom logic. Example: You created a “Home Automation” custom skill. And to perform the action “open the door” you need to say “ask Home Automation to open the door”. Name-free interactions are in beta for a while and still can’t be used.
  • A limited choice of “wake words” without a way to supply your own.

There is also the Alexa for Business, which is a deployment option of Alexa for organizations focused on mass-deployment to meeting rooms and more all-around Alexa utilizing in offices.

Custom Hardware Solution Pros

  • Flexibility. You can change and control everything at every step, including custom wake words, name-free interaction.
  • A wide range of hardware and software that can be used. For example, in our case we are using Raspberry Pi + ReSpeaker.
  • Complete control over what’s available to end-users. Possibility to provide only actions/services without any generic tools, like built-in skills.

Custom Hardware Solution Cons

  • Implementation complexity and costs
  • Hardware and software updates management becomes your responsibility

Selected approach description

In this blog post we are creating a custom voice assistant based on the ReSpeaker 4-Mic Array for Raspberry Pi. Our goal is to determine how to add voice assistant capabilities to an existing hardware with minor modifications(like adding a microphone to the existing setup).

ReSpeaker 4-Mic Array for Raspberry Pi is a 4 microphone expansion board for Raspberry Pi designed for AI and voice applications. This means that we can build a more powerful and flexible voice product that integrates Amazon Alexa Voice Service, Google Assistant, and so on.

First of all, we want to our own wake words, mainly for user experience and marketing purposes. Additionaly, we need to have more control over actions available to the user and restrict the usage of irrelevant/not needed skills.

Code Samples

We'll be working with two different libraries. One is the “SpeechRecognition” Python library that's a wrapper for multiple APIs and is rather flexible. In the example below we will use Google Web Speech API. A default API key that is hard-coded into the SpeechRecognition library will is used in the example, so no additional configuration is needed. We will also check the Vosk library, which has offline capabilities.

SpeechRecognition Example

Before using SpeechRecognition with a real microphone we need to install some dependencies. To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package. The process for installing PyAudio depends on your operating system.

SpeechRecognition has support for several engines and APIs, online and offline, works with Python 2.7 and 3.3+. With the help of this library, we can show our recognized speech on a display if your Raspberry Pi has it.

Workflow Example

Person: open the door.
Assistant: (in a case if the recognition confidence level is too low) You said: open the door?
Person: Yes/ Sure.
Assistant(processes user’s request): OK, I am opening the door.

Here is a related code example:

In this example, we are capturing input from the microphone using the listen() method of the Recognizer class. This method takes an audio source as its first argument and records input from that source until the pause is detected. The resulting fragment is then being processed by the underlying speech recognition library/API.

Result:

This is a very simple example of how we can implement our own voice assistant. To improve our solution and to add an opportunity for the tool to answer via voice we can use a speech synthesis library or API, for example, Amazon Polly Service.

It's very important to remember speech recognition is only a half of the voice assistance. Once the request is processed, a proper voice feedback is needed.

Vosk-API Example

Vosk-API is a language binding for Vosk and Kaldi to access speech recognition from various languages and on various platforms, we will need Kaldi wrapper for python to test how recognition works. “Kaldi” is a speech recognition tool written in C++.

Note: After cloning the code from GitHub and performing all required steps, the code may fail with the following error: "Input overflowed". This can happen due to limited computational power of the device running the code. The quick and dirty workaround is to disable buffer overflow exceptions using the exception_on_overflow = False as the second parameter in the stream.read() method. Here is the code example:

A big advantage of using this library is that it works fully offline. The used model depends  a lot on the pronunciation( to get a decent recognition level, I had to  pronounce words very clearly) and recognition takes some time (for simple phrases like “open the door” it took more than 2.5 seconds). Both the recognition success rate and speed depend on a selected model, used hardware and other factors, like background noise/echo/etc.

Photos

Here is a photo of my setup. You can see the ReSpeaker on the left and Amazon Alexa on the right.

 

Testing Results

We’ve tested three voice assistants, Amazon Alexa, ReSpeaker with SpeechRecognition lib and ReSpeaker with Vosk-API lib. We tried to make this tests as unbiased as possible, and therefore we used different sources of voice: Amazon Polly and a standard human voice. Also, we made tests on the two most popular distances: 1m and 3m. Here are the results:

Command Distance Speech Type Amazon Alexa Custom Hardware 1 Custom Hardware 2
"Open the door" 1m Human 3/3 3/3 Depends on the pronunciation
"Open the door" 3m Human 3/3 3/3 Depends on the pronunciation
"Open the door" 1m Synthesis 3/3 Depends on the voice type Depends on the voice type
"Open the door" 3m Synthesis 3/3 Depends on the voice type Depends on the voice type
"Give me two spanners" 1m Human 0/3 0/3* 0/3
"Give me two spanners" 3m Human 0/3* 0/3* 0/3
"Give me two spanners" 1m Synthesis 3/3 0/3* 0/3
"Give me two spanners" 3m Synthesis 3/3 0/3* 0/3

* - While Amazon Alexa won't execute the skill because of the incorrectly recognized speech, in theory with a custom code approach we can further analyze the text, possibly using a distance between words to guess the actual request. For example, look at results below:

If you check the table above again, you'll see Google Speech Recognition API also had trouble understanding the request generated using Amazon Polly. So homonyms, homophones and similar linguistic phenomena  can cause lots of troubles for voice assistants(this is the case when AI and humans experience same problems). The possible solution is processing of data based on a context(we, humans, do it often) for a more correct interpretation.

Interesting facts

  1. Distances of 1m or 3m had no effect on the recognition quality.
  2. Simple commands as “Open the door” work better than complex commands with many words.
  3. Depending on the computational resources available, it may have more sense to do the actual recognition in the cloud.
  4. Standard models for offline speech recognition libraries may have poor performance and speech recognition success rate. A more specific training may result in a more precise recognition.
  5. Amazon Alexa recognizes speech produced by Amazon Polly with a very high success rate. This is actually very logical, since most likely Amazon Polly is used internally by Amazon either for testing or generating training data for Amazon Alexa (just a guess).

Conclusion

Thank you for reading! We hope this blog post provided you with understanding of available options to add voice-based controls to your applications. As you can see, creating voice assistants is both interesting and challenging task.

It's very important to understand the voice-based controls have both pros and cons. Before implementing it, you need to carefully analyze the environment,  type of end-users and desired interaction workflows. The security and access control must be taken into account too.

Please share your thoughts and experience of using voice assistants and voice-based controls in business applications!