What do Espressif charge to train a new wakeword? #72

RJ · 2023-05-19T17:23:20Z

RJ
May 19, 2023

for the full service "we organise it all for you, including collecting hundreds of samples" option.. doesn't sound cheap.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/wake_word_engine/ESP_Wake_Words_Customization.html

kristiankielhofner · 2023-05-19T17:37:37Z

kristiankielhofner
May 19, 2023
Maintainer

Great question and one I like to talk about A LOT.

Willow is the first open source project to deliver truly Alexa-level commercial grade wake because I have > 20 years in audio, speech, etc experience and very few people understand just how challenging it is to get truly commercial grade reliable wake while minimizing false wake. There are a lot of people out there who think you can just show up, repeat a word a few times, and have reliable wake. Doesn't happen. Not to mention the other issues with audio processing on device to get clean audio in far-field and acoustically challenging environments...

The process they lay out is essentially the industry standard for wake engines. In fact, the entire stack we build on (from Espressif) including the built-in wake words has been tested and qualified by Amazon themselves to be an Alexa platform frontend. So when I say we're "Alexa grade" I mean it in a very real sense.

As you can see from the docs the standard is daunting - 20k samples, > 500 speakers (including children), professional recording, audio engineering, significant testing and validation, etc.

So yes, very expensive. We use provided wake words (Alexa, Hi ESP, etc) because they were trained with that process and are freely available with the underlying ESP-SR we leverage. We have plans to fund "Hi Willow" or something along those lines depending on input from the community.

Additionally, Willow has received a bit of commercial interest as expected and that is our exact monetization strategy - custom manufactured hardware devices with custom wake and UI, etc. See here for an example.

0 replies

dpatelli · 2023-05-31T19:48:08Z

dpatelli
May 31, 2023

Awesome project! Just what I've been looking for to get away from Amazon or Google for my voice commands for my home automation.

May I place my vote in for a single word wake word? Just plain "Willow" would be fine.

1 reply

kristiankielhofner May 31, 2023
Maintainer

Industry standard is at least three syllables to avoid false activation so "Hi Willow" or similar would be required.

MatthewK78 · 2023-06-01T20:01:41Z

MatthewK78
Jun 1, 2023

First I just want to say a big thanks for developing this project!! 😄 Willow is super cool, and I'm excited to see where it goes! I had almost given up all hope on voice assistants but this has reinvigorated my interest!

I love the name Willow – so much, in fact, that my dog shares the name! Coincidentally, one of my friends also has a dog named Willow. So using "Hi Willow" might lead to some really confused puppers around our homes. As for "Hi ESP," it's not really doing it for me, and I'd rather avoid using "Alexa" as well. (Also, could there be any potential legal issues with enabling "Alexa"?)

As a huge Star Trek fan, using "computer" as a wake word would be an awesome throwback, and it does have those three syllables we're aiming for. But I get it; "computer" might be too common in everyday chatter for practical use. Here's to hoping LLMs bring us closer to context-aware voice assistants!

A few more ideas:

"Orion" - a constellation deeply connected to stars and space, and also has a nice IoT device vibe.
"Harmony" - representing balance and unity, which is essential in controlling smart devices efficiently.
"Nebula" - signifies a sense of unity between our devices, and it's just fun to say.

0 replies

kristiankielhofner · 2023-06-02T10:32:39Z

kristiankielhofner
Jun 2, 2023
Maintainer

Thanks!

We got pretty lucky with Willow. Naming projects and companies is one of my least favorite things and Willow came from a friend of mine who was excited about Whisper but kept calling it Willow for some reason ;).

Hi ESP is actually a terrible wake word. It's very awkward, and we're learning again and again that some people can't quite nail it, largely because of having to clearly annunciate E-S-P. Right now our only other real alternative is "Alexa", which people also don't like for obvious reasons.

Our likely approach will be to select a few finalists (with my filtering of wake words that are clearly impractical) and start Kickstarter or similar campaigns for each of them with pass-through pricing of the Espressif costs. Anything that achieves the target fundraising goal will be produced. For example, "computer" does seem to be a popular one. I'd never use it, but that's not a reason to create it (it's actually the only non-Amazon branded wake word supported with Alexa/Echo). Espressif currently provides Alexa with the esp-sr framework and I can't imagine there is any legal exposure for us leveraging it.

Your other suggestions (Orion, Harmony, Nebula) are excellent and if we did a poll I suspect at least one of them would be a finalist.

2 replies

dslugPX Jun 2, 2023

Not having to say two words is really ideal. I hate saying Alexa, but we have three boxes in the house, and the one that is most consistently easy to work with is the alexa one. saying HI ANYTHING changes the cadence in an awkward way.

My wife wants to say computer, for the obvious reasons, but she would also like to say "house" do XYZ. Though "house" is an impractical wake word.

But OK google sucks because of the OK (and the google, but you get it).
Hey Siri, the hey is awkward.

One word and three syllables is where it's at for sure around here though.

All of this aside. I have not seen an answer - yet - even a SWAG answer of what it actually costs.

is it 100's of dollars, thousands. 10's of thousands?

Simply have no frame of reference for these things.

obones Jun 2, 2023

For what it's worth, Orion is only two syllables in French, just like Onion would be in English so it wouldn't be good wake word here.
The other two are good in that regard but it would be important to keep in mind that it needs to account for foreign languages accents. And finally, could we also have male name suggestions?

MatthewK78 · 2023-06-02T20:06:32Z

MatthewK78
Jun 2, 2023

Thanks Kristian. 😊

I agree dslugPX, one word and three syllables is definitely the way to go. 👍

0 replies

kristiankielhofner · 2023-06-04T14:21:36Z

kristiankielhofner
Jun 4, 2023
Maintainer

Three syllables minimum is a hard requirement.

In terms of other suggestions, the large commercial voice assistants have carefully selected wake words for good reason. Things like:

Proper nouns and especially names tend to have more universal pronunciation. Of course nearly anything depending on training data can be an issue but generally speaking names such as "Alexa", "Amazon", and "Google" are more uniformly pronounced (across languages, accents, etc) and thus lend themselves towards more reliable wake for more people. Willow is a tree, a gender neutral first name, and also happens to be the same word as a certain renewed large media franchise (movie and recent TV series).
The "Hi, X" wake words tend to also have significantly better false wake activation as they capture more intent, context, and speech timing/cadence.

We're here because we appreciate Willow and the fundamental merits of it vs Alexa, etc. We value flexibility, user choice, and user control. However, to reach our primary goal of being competitive with these commercial projects we should learn from some of the approaches they have taken to achieve the quality user experience (compared to existing open source voice interfaces) they provide.

There are some more-or-less hard lessons learned and fundamental rules with voice interfaces and we'd be foolish to think these rules don't also apply to us. For things as fundamental as wake word we need to be careful not to nerf the user experience significantly because we didn't consider all of these fundamental issues for additional wake words.

People have strong opinions on wake word and in an ideal world the scenario everyone seems to want of "let me use anything" is a recipe for disaster as the many failed projects attempting this route demonstrate.

In short, it's unlikely everyone is going to be completely happy with one or more selected wake words. On that, I don't really know what to tell you - for something like wake word there are many good reasons why it "is what it is". My hope is that Willow is so overwhelming valuable and otherwise useful that having to prefix a wake word with "Hi/Hey" or use a name/word you generally don't love isn't a deal-breaker for use of Willow.

The reality basically boils down to "custom wake words don't work, you can certainly attempt it with the other open source options but the performant, reliable, and accurate choices currently are Alexa/Google/etc or Willow".

0 replies

MatthewK78 · 2023-06-04T22:10:07Z

MatthewK78
Jun 4, 2023

I believe it's essential to find a wake word that resonates with users and avoids any potential awkwardness or annoyance, especially considering it will be used frequently every day. For example, "Okay/Hey Google" is considered cumbersome by numerous people. Using "Hi Willow" may also lead to frustration, especially for those who have personal connections to the name.

Alexa, though an efficient wake word (one word, three syllables, and easy pronunciation), has likely caused inconvenience to individuals with that name in their day-to-day lives. I personally know someone named Siri who wasn't too pleased when Apple chose that name for their voice assistant. In 2021, Willow ranked as the 39th most popular name for girls, which indicates a considerable number of potential users sharing this name.

I'd like to revisit "Computer" again. This article provides some interesting insights into the benefits and reasons for adopting it: https://www.salon.com/2017/11/26/dont-call-it-siri-why-the-wake-word-should-be-computer/

When LLMs become more efficient, we could potentially use context-aware reasoning to determine user intent and improve conversation state retention. This approach would likely require devices to continuously stream input audio to a server capable of handling the LLM. Local implementation will help address privacy concerns, in addition to the open-source nature of the project. Combining Whisper with an LLM might even eliminate the need for wake word training altogether, and present additional benefits. While this might be a project for the future, I believe it would be advantageous to have the continuous input audio stream capability already in the existing code to support such a feature.

3 replies

kristiankielhofner Jun 4, 2023
Maintainer

Willow being the 39th most popular name in a given year pales in comparison to "computer" - an extremely commonly used word (which is also language specific). With "computer" I'd personally cause false wake activation dozens of times a day - more often than my actual intended use of Willow. It's important to remember that the Star Trek universe also has Warp drive... It's a science fiction media franchise (emphasis on fiction).

That Salon article makes some good points but it was written by an attorney - not a speech researcher or even technologist. I assure you Google, etc is well aware of complaints surrounding "Hi Google" but it has a tremendous advantages. Not the least of which is no one uses the utterance "Hi Google" in conversation as it's non-sensical to want to say "hi" to Google outside of the voice assistant. As you note "Alexa" and Hi Willow" have the problem of false wake activation but hopefully we're all starting to understand all of this comes down to trade-offs and compromises. "Hi" is common and consistent across many languages and "Google", "Willow", and "Alexa" are proper names with nearly universal pronunciation across languages.

What's most interesting to me on the perceived awkwardness of "Hi Google" and similar is that it actually mirrors natural human speech - when attempting to ask a specific person to complete a task or action you get their attention first by addressing them by name: "Hey Kristian - can you help me with my wake word?". I'm also not completely understanding the passion and strong opinion people have for a specific wake word. People don't seem to be bothered in the slightest addressing other people based on their specific/chosen name.

WIS has support for LLaMA based LLMs. While they're all the rage the application here doesn't make any sense - if you're streaming audio to something with an LLM running (already a bad idea) you first need to do speech to text (as we do with everything after wake) at which point you can do fairly simple parsing (with or without NLU/NLP) to do "wake word detection". LLMs (even with every known trick in the book) also require tremendous amounts of memory and processing relative to WIS ASR today and are extremely slow compared to WIS ASR (which is already slower than our current wake detection).

Why is streaming audio a bad idea? Imagine you have five Willow devices. All five will be stomping on your available WiFi bandwidth/spectrum in addition to keeping multiple ASR threads active 24/7 on WIS. You would need to run very tight loops over relatively small audio chunks to try to capture the wake word, likely with a buffer to capture statements after wake. You'd need to also support processing incoming audio across chunks because there will always be the possibility that your wake + complete command sequence will not land in the same chunk due to timing issues. You would also incur the additional overhead of attempting to do voice activity detection to know when to end the command after the wake word. Not to mention the challenge of not having access to the audio frames at this point - so you would need to define your wake word in the unicode text output of your specific language.

Attempting to do this in any kind of latency responsive way is very challenging/currently impossible - even on something like an RTX 3090/4090 you'd likely be looking at a wake word recognition floor in at least the hundred millisecond range. This would absolutely require GPU and you'd be burning hundreds of watts at idle just to try to keep up with X sessions from X devices.

I can understand why this approach may seem like a good/obvious way to do wake word detection. There are multiple projects attempting it. It's not going well (and never will for the foreseeable future) due to at least the issues I've mentioned and more.

nikito Jun 6, 2023

All good points. I also agree with the sentiment of Hi/Hey being natural interaction with an entity. I think "Hey" feels a bit more natural (when I think about it, usually if you want to get someone's attention you say hey, whereas hi is a greeting. I think people tend to say "Hey [person], could you [request] please?".

I can understand people's wish to use "Computer"; Most people here are probably Star Trek fans (we're all techies/sci-fi fans, one of the reasons we are here I imagine 😄) so people want that callback to that nostalgic desire to emulate the "future" Star Trek brought us to. But as mentioned from a practicality standpoint it may not be ideal (I know I say computer dozens of times a day in my house in normal conversation).

I also get people saying they want a wake word that "resonates" with them; We want our assistant to feel personal, so we want a name that has some sort of meaning to us. But unfortunately a name to one person has a different meaning/taste as opposed to another, and not everyone will ever completely agree or connect with a given name. I think the best approach is what was mentioned, where several good candidates are chosen and kickstarters are made for people to vote and contribute to making that wake word a reality. That gives the community a chance to be a part of the process and land on something that everyone can be happy with, even if it is a compromise.

I can think of a few wake word names I'd personally like as well (and I'd even want them to accept "Hey" before the name because I think I naturally say Hey to get attention and would subconsciously say it even if it isn't part of the wake word. I actually do this with Alexa very often 😆 ) but I'd also be fine with "Hey Willow". I may not have any personal connection or desire for it, but it's more or less the same as saying "Hey Google" to me, except I'm addressing an actual name instead of a company which feels better to me. 😄

Just my two cents!

kristiankielhofner Jun 7, 2023
Maintainer

This is fantastic feedback (and confirmation of our thinking) and you're touching on what we're seeing - when it comes down to it if people hade their way they want individually selected wake words for a personal/intimate experience but between the limitations of training, number of samples, and the ability for users to quickly and easily "shoot themselves in the foot" it's just not practical.

As I've noted several times before Willow has already enabled a user experience that far surpasses any other open source voice assistant approaches. We insist on a "beat commercial solutions in every possible way" level user experience. We don't want to doom our users to failure even if it's what they think they want. I know I kind of seem like a stickler on these topics but we brought this level of user experience to the open source voice assistant ecosystem expressly because I have ample experience in this field and I generally know what works and what doesn't work.

I don't and can't expect users to be cognizant of what I've learned over the decades I've spent implementing these related technologies and I feel I have a responsibility to try to communicate many of these hard learned lessons to the community.

Chrusciki · 2023-06-08T19:42:05Z

Chrusciki
Jun 8, 2023

i would love to learn the process of wake word generation for these boxes. at the moment the fact we are tied to hi esp and alexa with only paying Espressif to generate private wakewords is a show stopper for the esp32 box. i'd rather throw rassphy on a device instead, i can train my wakeword and point it at WIS.

i understand its a complex process but we need the ability to train new wake words on our own without Espressif.
Perhaps using generative TTS models to make the dataset and fine tune with a more limited real voices?

edit, do we know the cost of wake word creation?

1 reply

kristiankielhofner Jun 8, 2023
Maintainer

One of the best things about open source is the freedom and choice.

You're free to use whatever is available however you wish. We'd be really interested to hear about your experience with WIS and Rhasspy!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What do Espressif charge to train a new wakeword? #72

{{title}}

Replies: 8 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What do Espressif charge to train a new wakeword? #72

Replies: 8 comments · 7 replies

kristiankielhofner May 19, 2023 Maintainer

kristiankielhofner May 31, 2023 Maintainer

kristiankielhofner Jun 2, 2023 Maintainer

kristiankielhofner Jun 4, 2023 Maintainer

kristiankielhofner Jun 4, 2023 Maintainer

kristiankielhofner Jun 7, 2023 Maintainer

kristiankielhofner Jun 8, 2023 Maintainer

Replies: 8 comments 7 replies

kristiankielhofner
May 19, 2023
Maintainer

kristiankielhofner May 31, 2023
Maintainer

kristiankielhofner
Jun 2, 2023
Maintainer

kristiankielhofner
Jun 4, 2023
Maintainer

kristiankielhofner Jun 4, 2023
Maintainer

kristiankielhofner Jun 7, 2023
Maintainer

kristiankielhofner Jun 8, 2023
Maintainer