Apr 08, 2020
While voice assistants are very popular in consumer applications, the enterprise market is starting to explore possibilities of contactless interaction with end-users by implementing voice-based technologies.
In this blog post, we are going to show what’s the difference between simply using Amazon Alexa with implementing a custom voice assistant using AWS Alexa SDK and other tools.
Voice assistant is a piece of software that uses commodity or specialized hardware, speech recognition, natural language processing and speech synthesis to replace traditional controls like mouse, keyboard, touchscreen with voice commands. Right now the consumer market of voice assistants is being dominated by three tech giants: Amazon, Google, Apple. Each of these companies easy-to-use devices that enable end-users to control multimedia devices, listen to the latest news and even talk to their smart home devices.
On the other hand, business is only yet to adopt voice-based controls. There are many reasons for that, including security aspects, mass-deployment issues and speech recognition limitations in a case of noise or other sound distortions, like echo.
We believe that this is a prominent technology, especially if used in a combination with other human-computer interaction methods. In this blog post we are going to check possibilities of Amazon Alexa, AWS Alexa SDK and other libraries.
We'll go through two possible approaches of implementing voice control in business applications. First, we can use Amazon Alexa or Amazon Alexa for business as a ready-to-use hardware/software solution. Also we can use Amazon Alexa SDK or another SDK to develop a custom voice assistance according to our needs. Both approaches have their own pros and cons.
There is also the Alexa for Business, which is a deployment option of Alexa for organizations focused on mass-deployment to meeting rooms and more all-around Alexa utilizing in offices.
In this blog post we are creating a custom voice assistant based on the ReSpeaker 4-Mic Array for Raspberry Pi. Our goal is to determine how to add voice assistant capabilities to an existing hardware with minor modifications(like adding a microphone to the existing setup).
ReSpeaker 4-Mic Array for Raspberry Pi is a 4 microphone expansion board for Raspberry Pi designed for AI and voice applications. This means that we can build a more powerful and flexible voice product that integrates Amazon Alexa Voice Service, Google Assistant, and so on.
First of all, we want to our own wake words, mainly for user experience and marketing purposes. Additionaly, we need to have more control over actions available to the user and restrict the usage of irrelevant/not needed skills.
We'll be working with two different libraries. One is the “SpeechRecognition” Python library that's a wrapper for multiple APIs and is rather flexible. In the example below we will use Google Web Speech API. A default API key that is hard-coded into the SpeechRecognition library will is used in the example, so no additional configuration is needed. We will also check the Vosk library, which has offline capabilities.
Before using SpeechRecognition with a real microphone we need to install some dependencies. To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package. The process for installing PyAudio depends on your operating system.
SpeechRecognition has support for several engines and APIs, online and offline, works with Python 2.7 and 3.3+. With the help of this library, we can show our recognized speech on a display if your Raspberry Pi has it.
Person: open the door.
Assistant: (in a case if the recognition confidence level is too low) You said: open the door?
Person: Yes/ Sure.
Assistant(processes user’s request): OK, I am opening the door.
Here is a related code example:
In this example, we are capturing input from the microphone using the listen() method of the Recognizer class. This method takes an audio source as its first argument and records input from that source until the pause is detected. The resulting fragment is then being processed by the underlying speech recognition library/API.
This is a very simple example of how we can implement our own voice assistant. To improve our solution and to add an opportunity for the tool to answer via voice we can use a speech synthesis library or API, for example, Amazon Polly Service.
It's very important to remember speech recognition is only a half of the voice assistance. Once the request is processed, a proper voice feedback is needed.
Vosk-API is a language binding for Vosk and Kaldi to access speech recognition from various languages and on various platforms, we will need Kaldi wrapper for python to test how recognition works. “Kaldi” is a speech recognition tool written in C++.
Note: After cloning the code from GitHub and performing all required steps, the code may fail with the following error: "Input overflowed". This can happen due to limited computational power of the device running the code. The quick and dirty workaround is to disable buffer overflow exceptions using the exception_on_overflow = False as the second parameter in the stream.read() method. Here is the code example:
A big advantage of using this library is that it works fully offline. The used model depends a lot on the pronunciation( to get a decent recognition level, I had to pronounce words very clearly) and recognition takes some time (for simple phrases like “open the door” it took more than 2.5 seconds). Both the recognition success rate and speed depend on a selected model, used hardware and other factors, like background noise/echo/etc.
Here is a photo of my setup. You can see the ReSpeaker on the left and Amazon Alexa on the right.
We’ve tested three voice assistants, Amazon Alexa, ReSpeaker with SpeechRecognition lib and ReSpeaker with Vosk-API lib. We tried to make this tests as unbiased as possible, and therefore we used different sources of voice: Amazon Polly and a standard human voice. Also, we made tests on the two most popular distances: 1m and 3m. Here are the results:
|Command||Distance||Speech Type||Amazon Alexa||Custom Hardware 1||Custom Hardware 2|
|"Open the door"||1m||Human||3/3||3/3||Depends on the pronunciation|
|"Open the door"||3m||Human||3/3||3/3||Depends on the pronunciation|
|"Open the door"||1m||Synthesis||3/3||Depends on the voice type||Depends on the voice type|
|"Open the door"||3m||Synthesis||3/3||Depends on the voice type||Depends on the voice type|
|"Give me two spanners"||1m||Human||0/3||0/3*||0/3|
|"Give me two spanners"||3m||Human||0/3*||0/3*||0/3|
|"Give me two spanners"||1m||Synthesis||3/3||0/3*||0/3|
|"Give me two spanners"||3m||Synthesis||3/3||0/3*||0/3|
* - While Amazon Alexa won't execute the skill because of the incorrectly recognized speech, in theory with a custom code approach we can further analyze the text, possibly using a distance between words to guess the actual request. For example, look at results below:
If you check the table above again, you'll see Google Speech Recognition API also had trouble understanding the request generated using Amazon Polly. So homonyms, homophones and similar linguistic phenomena can cause lots of troubles for voice assistants(this is the case when AI and humans experience same problems). The possible solution is processing of data based on a context(we, humans, do it often) for a more correct interpretation.
Thank you for reading! We hope this blog post provided you with understanding of available options to add voice-based controls to your applications. As you can see, creating voice assistants is both interesting and challenging task.
It's very important to understand the voice-based controls have both pros and cons. Before implementing it, you need to carefully analyze the environment, type of end-users and desired interaction workflows. The security and access control must be taken into account too.
Please share your thoughts and experience of using voice assistants and voice-based controls in business applications!