Designing the Listener

The primary interaction method with Karen is through speak recognition/synthesis so the listener is a critical component of Karen’s architecture. The process for converting speech to text has become pretty straight forward over the last several years. There are cloud options like Google and Amazon that send snippets of audio to the cloud for parsing and returning the containing text. Likewise there are self-contained options that do all the capturing and processing without requiring an Internet connection. There are pros and cons for both.

Karen does not use any of the cloud options, but making a choice of that magnitude requires some explanation so let’s walk through the logic. First, cloud-based speech processing requires an internet connection. That on its own isn’t a deal breaker, but it definitely limits some of the applications where a voice assistant could be leveraged. Second, utilizing the cloud for speech processing has costs associated with it. Yes, you can use free trials and low volume options, but at the end of the day if you’re going to really use a voice assistant then you’re going to have a monthly charge from one of the cloud providers. Third, the cloud options all use proprietary models. While they are quite good at converting speech to text they are also limited by the need to recoup the massive investment by companies like Amazon, Google, and Siri for producing these models. This means you cannot provide a fully open using a cloud-based provider (at least as of this post). Lastly, sending data to the cloud has privacy implications. I mean, how do you really know what those providers are doing with the snippets of audio you send them? Is it strictly used to produce text that only you get the result for and then permanently deleted? Do they keep those audio snippets around to help improve their model (and if so then you should be appropriately compensated for helping them, right)? It just gets really messy the more you did into that model and so I suspect the future will likely change some things in that space… for now that just isn’t the right option for Karen.

The alternative to cloud-based options are not exactly perfect either. For local processing Karen uses Mozilla’s DeepSpeech. This uses a completely open source speech model, but since it’s done essentially by volunteers the accuracy is not as good as the proprietary models (e.g. DeepSpeech is 80% accurate and Siri is 96%+ accurate). That being said, since it is truly open then it will continue to evolve and grow over time to eventually achieve the accuracy of the big companies. Also, because it is opensource it has a fairly good following of developers that leverage it for different scenarios and will help you get it working if you dive into the conversation stream.

One last note is that while Karen uses Mozilla’s Deepspeech there is nothing preventing you from modifying the listener to use any service or tool that you like. The benefit of opensource is that you can make it work the way you want it to so long as you’re willing to put in the effort to do so.

Mozilla DeepSpeech

Okay now that you know why Karen is using DeepSpeech let’s talk a little about her implementation of it.

The process for creating text from audio capture leverages a few components and the process generally involves capturing snippets of audio and feeding them into a model to generate the text. There is a great sample of how to do this in the Deepspeech documentation (look for the example method named _read_from_mic().

The example uses another library named webrtcvad which is designed to detect if the audio in the snippet is text or not. It’s method “vad” or “Voice Activity Detection” enables this boolean evaluation of an audio frame which gives us enough information to determine if the subject is speaking or waiting on a response. We use this as one element to determine if we want to process the snippet or not. The size of the snippet and tolerance of inaccuracy are all variables that can be tweaked and we’ll talk about those a bit more as things progress.

Once we’ve determined the inbound audio is speech and meets our threshold for speech detection we’ll start feeding the frames into the model until there is a break in the inbound audio. At that point we’ll ask the model to tell us what was said. Simple, right?

There is a pretty basic assumption in all of this that I need to call out. The assumption (or constraint) is that at any moment in time Karen is either listening for speech or speaking herself, but never both. Why? If Karen tries to do both then we risk the loop of hearing her own speech. The human brain does this all the time and ignores the output while still listening for input… so we get interrupted and can interrupt others. For Karen, once she starts speaking we will not allow interruption (wouldn’t it be great if that could be true for all of us?). At some point we’ll work on improving this, but that’ll have to work for now.

So that leaves us with a processing structure like this…

  1. Listen
  2. Detect Speech
  3. Loop until Threshold of Speech Fails
    • Capture Audio Snippet
    • Feed to Model
  4. Convert to Text & Generate Response
  5. Speak Response

Many of the existing personal assistants utilize a wake word or hot word that tell the system that the audio it is collection should be converted. All of the personal assistants capture audio all the time… and I repeat… capture 100% of all audio around them. The question is when do they send that audio to the cloud for processing and that’s where the wake word comes in. Wake word processing uses a separate process that listens to the audio stream using local-only logic looking for a very specific block of audio. This is when you set up Siri’s “Hey Siri” you have to repeat yourself several times. It’s actually building a model in the background of that one phrase… which it will then use in a very power-efficient manner to identify triggers for standard processing. These routines are also proprietary and the open source options are limited and frustrating. MyCroft’s Precise is probably the best available, but as of now Karen is not using a wake word because 100% of her audio processing is done locally. If that weren’t the case then we’d have to have some sort of wake word just to limit the amount of data you’d be streaming to the cloud providers. Anyway, for now, Karen is like a normal person and listens to everything… which is part of the goal as she should be enhanced to figure out when she is being spoken to instead of relying on you calling her name each and every time you want something.