Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Speech API #170

Open
adamopenweb opened this issue Jun 21, 2019 · 44 comments
Open

Web Speech API #170

adamopenweb opened this issue Jun 21, 2019 · 44 comments
Labels
under review venue: W3C CG Specifications in W3C Community Groups (e.g., WICG, Privacy CG)

Comments

@adamopenweb
Copy link

Request for Mozilla Position on an Emerging Web Specification

Other information

  • Is this spec good for the web?
  • Is it the right thing to put it in the browser?
@martinthomson
Copy link
Member

At a high level, the ability to access speech recognition and synthesis capabilities on the web is reasonable. Providing access to speech can do a lot to improve agency and accessibility on the web, and I don't see any significant problems with providing platform features that improve that. Much of these capabilities are already available through existing APIs, but the burden of providing the recognition/synthesis parts are borne by sites. This does a lot to improve access for all sites, not just those with the resources to build those systems.

However, I think that the form of the proposed API is not entirely consistent with our principles on how we might build those features.

These seem to be driven by the Android speech APIs without a lot of consideration for the APIs already present on the web. This could and probably should operate on media streams, both for input to recognition and for output of synthesis. In particular, this should use the getUserMedia API for access to the microphone and the related permissions.

For the text, it would be good to understand how this might be mapped to WebVTT and its APIs and other APIs that operate on audio.

I would also like to gather the views of those who manage editor and keyboard input. I think that the recognition parts would benefit a lot from their views and experience. For instance, it would be good to understand how much - if at all - access to speech recognition might best be integrated with text input in <input type=text> and rich editors. I've asked Masayuki Nakano to comment here.

In thinking through the security of this, there is one potential trap. If this is built as it is defined, then speech recognition might be tuned to an individual. However, if a site can provide arbitrary audio input to the engine as I suggest, then any tuning might be exploited by a site to learn about individuals. Minimally, that is a fingerprinting exposure. But it might also go beyond that to learn about the speech (including characteristics of voice and language, like gender) of the person using the computer, without requiring any permissions. That should be easy enough to defend against, but it's a consideration that could easily be overlooked.

@annevk annevk added the venue: W3C Specifications in W3C Working Groups label Jun 25, 2019
@jan-ivar
Copy link
Member

jan-ivar commented Jun 26, 2019

This could and probably should operate on media streams, both for input to recognition and for output of synthesis

Yes—maybe even on audio MediaStreamTracks directly (unless there are plans for lipreading?)

Relying on getUserMedia, simplifes the speech spec's permission story, and gives JS more control, letting it obtain the microphone independently of this API, and call the speech api multiple times as needed. This matters, since not all browsers implicitly persist microphone permission after a single use like Chrome does.

For example, in Firefox, the existing API ends up requesting permission from the end-user each time the speech API is called by default. This is undesirable, but the current API invites it, and leaves us and browsers like us with no way to fix this problem.

The only workaround atm is to rely on web developers knowing that current implementations AFAIK call getUserMedia under the hood, thus calling getUserMedia({audio:true}) first, should cause subsequent mic requests to bypass any permission prompt for as long as they hold on to the resulting tracks as dummies. I don't have high confidence that web developers will do this, since "it just works in Chrome".

A quick band-aid API fix would perhaps be for start() to take a track argument:

const recognition = new SpeechRecognition();
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
recognition.start(stream.getAudioTracks()[0]);

This quick change might be worth fighting for before considering releasing this API?

On privacy, to me, the risks seem to all stem primarily from microphone permission, since, as mentioned, JS can already do speech recognition in software. I agree the risks are worth mentioning however, maybe even in the spec.

@kdavis-mozilla
Copy link

In thinking through the security of this, there is one potential trap. If this is built as it is defined, then speech recognition might be tuned to an individual.

By whom?

But it might also go beyond that to learn about the speech (including characteristics of voice and language, like gender) of the person using the computer...

If an attacker had a speech corpus tagged with gender, a much easier attack would be to simply train a machine learning model to recognize gender, then use the normal microphone access to send audio to their model.

The attack suggested is far more complicated. The attacker would have to pass their corpus to the STT engine, record the STT engine's response, create another machine learning model that learned the mistakes made by the STT engine that were correlated with gender. The the attack would be to send speech to the STT engine, get the transcript back, send the transcript to their engine, have their engine guess gender. So using the STT engine to determine gender would add an unnecessary call to the STT ending and decrease the accuracy of the gender determination.

So in other words there are better ways of determining gender that are already present in the browser.

@kdavis-mozilla
Copy link

On privacy, to me, the risks seem to all stem primarily from microphone permission, since, as mentioned, JS can already do speech recognition in software.

This is, with more details, the point I wanted to emphazise above.

@martinthomson
Copy link
Member

then speech recognition might be tuned to an individual.
By whom?

If the browser runs the model, it makes sense for the model to be tuned for the person who uses the browser.

So in other words there are better ways of determining gender that are already present in the browser.

I am talking about the scenario where the site does not have access to the microphone. If microphone access is granted, then we have to assume that the site will learn anything exposed by that, but if an arbitrary stream can be fed into the recognition engine, then that can be used to probe the state of the engine. If that state includes information that we might prefer to consider private (for which I only used gender as an example), then that would not be good.

The attack suggested is far more complicated.

I don't mean to imply that this is easy, but when talking about attacks, we have to consider possibilities regardless of their complexity.

A low complexity attack would just use the state of the voice recognition engine for fingerprinting. Sites have already proven a willingness to go to extreme lengths to fingerprint people and this doesn't seem that complicated.

@kdavis-mozilla
Copy link

If the browser runs the model, it makes sense for the model to be tuned for the person who uses the browser.

A few points on this..

  1. Ignoring the WebSpeech API for a moment. If one uses normal mic access, cookies, and a server based STT engine, one can tune the model to the person who uses the browser. So, this means of attack is already possible without the WebSpeech API.
  2. The initial integration will be server based. So the in browser tuning you are referring to does not apply.
  3. A future integration will run in the browser. However, it can not be tuned for a particular person.

If microphone access is granted, then we have to assume that the site will learn anything exposed by that, but if an arbitrary stream can be fed into the recognition engine, then that can be used to probe the state of the engine. If that state includes information that we might prefer to consider private (for which I only used gender as an example), then that would not be good.

Probing the state of a STT engine in the way you suggest, using the mic, is possible without the WebSpeech API by using standard mic access and any number of existing STT services.

My point is that we should be considering new lines of attack that are being opened as a result of the WebSpeech API.

I don't mean to imply that this is easy, but when talking about attacks, we have to consider possibilities regardless of their complexity.

Fair enough. But my point is that the attack you mention is already possible without the WebSpeech API and easier to do without using the WebSpeech API.

@martinthomson
Copy link
Member

I think that I wasn't clear enough. My point is that this information might be exposed without a permissions prompt. That is, if the model might be used on arbitrary audio, then asking for permission for a microphone is not necessary. I don't see any reason, or even a plausible excuse, for asking for permission to use this API.

The risk profile for features that aren't permission-gated is very different.

@kdavis-mozilla
Copy link

My point is that this information might be exposed without a permissions prompt.

OK. Got it!

If the Web Speech API is implemented as per the spec, see Security and privacy considerations

Screenshot 2019-07-06 at 08 08 08

then only an incorrect implementation would not prompt the user.

For example, "Accepting a permission prompt shown as the result of a call to SpeechRecognition.start" would always prompt the user regardless of the audio source.

@martinthomson
Copy link
Member

That very much assumes the current design. I should have said that the questions regarding permission-free access only applies if you take the suggestion to better integrate with other existing APIs. If you build the API as specified (which would not be good for the platform as a whole in my opinion), then of course you would want to have some sort of permission scheme with notice about recording.

@kdavis-mozilla
Copy link

So you suggest that to better integrate existing API we should remove permission prompts? This cure seems worse than the disease. What in the existing standard is so problematic that it is worse than removing permission prompts?

@jan-ivar
Copy link
Member

The current API is tightly coupled with microphone permission for no clear reason, causing a permission prompt in Firefox every time the speech API is called. It's biased toward Chrome's permission model.

For users with more than one microphone, once they persist permission, there is no API for selecting which microphone to use.

There's no way to recognize speech from audio sources other than local mic, e.g. from a WebRTC call.

There's no way to pre-process audio—e.g. with web audio—before feeding it to the speech api.

There's no API for controlling noise suppression, gain control or echo cancellation.

There's no way to direct output to something other than local speakers, e.g. send over a WebRTC call.

Interactions with getUserMedia permissions are poorly specified.

@marcoscaceres
Copy link
Contributor

marcoscaceres commented Jul 10, 2019

Sounds like there is good opportunity for the WebRTC folks and Web Speech folks to work together to share expertise on getting better permissions integration and overall API cohesion.

Conceptually, enabling "Web Speech" does seem like something we are pursuing at Mozilla/Firefox - so this might already be in the "worth prototyping" realm - but with the proviso that the API maybe need some refinements. Would that be a fair assessment?

It definitely sounds like there are some real opportunities to collaborate.

@kdavis-mozilla
Copy link

@jan-ivar Most of your comments seem like, more-or-less, reasonable things we could change
with the API in the future. However, right now I don't see any lever we could use to change the
official API.

We are not in production with a standard or non-standard version of the API, where as Google,
Apple...are already in production. So, I see no reason they should listen to us and change their
API's in response to our requests, as it will only introduce breakage in their systems and win
them little to nothing.

The alternative, which seems to be what you are suggesting, is simply to ignore the official API,
to a much larger extent than Google, Apple... or anyone else has, and create a Mozilla API. This
might work if we were a market leader, but we are not. So it seems as if it would fragment things
worse than they already are fragmented, which to me seems like the wrong direction to go in.

I'd be interested in understanding how and when, or if, you want to get your suggestions into
the official API.

@marcoscaceres
Copy link
Contributor

I think what we are proposing is a set of changes to improve how the API works. And that could also help with privacy, security, and overall web-platform integration. Yes, that will likely come with some breaking changes at the cost of interop. And yes, we will need to convince our friends at Google and Apple to change their implementations - and with very good reason. But the spec is by no means "official"... it's a Community Group Draft and thus subject to change if we want it to be standardized officially.

Web Standards are all about making these compromises.

This might work if we were a market leader, but we are not.

Kindly please don't undersell Mozilla's position: We've always been constructive contributors to specifications and our suggestions have always been welcomed and taken onboard by the web community. We may not have 60% marketshare, but we do have a lot of users and our opinions matter as equally as anyone else in the community (irrespective of marketshare).

@martinthomson
Copy link
Member

I'm happy to say "worth prototyping" as long as the text made it clear that we think that the API is far from ideal in its current form.

@smaug----
Copy link
Collaborator

smaug---- commented Jul 10, 2019

I could note that the API is basically from 2011, well before getUserMedia or anything like that.
It is a modified version of one of the ideas HTML Speech Incubator Group had.
And it wasn't really modeled on top of Android APIs or anything like that.

And Chrome used to have something like <input type=text x-webkit-speech>, but that is a very inflexible approach, as was shown for example the work Nokia did ~15 years ago with X+V and such.
(One of the ideas the group had was a <reco> element, and that is relatively close to SALT https://en.wikipedia.org/wiki/Speech_Application_Language_Tags)

@jan-ivar
Copy link
Member

The alternative, which seems to be what you are suggesting, is simply to ignore the official API,
to a much larger extent than Google, Apple...

AFAIK only Chrome implements SpeechRecognition, which doesn't qualify as "official". Apple's microphone permission model is closer to ours, so I wouldn't assume their position here.

All but two of the concerns I raised seem solved by adding an optional track argument to start(). But doing so has the privacy concerns @martinthomson raised. I'm happy to discuss those further here.

A more aggressive attempt to move the needle on other implementations might include lobbying for this track argument to be required rather than optional, and only implement that in Firefox.

If we don't do that—meaning: if we don't bend web developers to use a different API surface where they obtain microphone permission separately using getUserMedia—then Firefox users (and I suspect Safari users, @youennf any thoughts here?) will likely get a permission prompt every time SpeechRecognition.start() is called, on most sites by default. That seems like an inferior experience to me for web users who resist persisting permission for privacy reasons.

A counterpoint would be if having access only to speech results and not the audio itself constitutes a lower privacy risk profile, but I don't think it does.

@othermaciej
Copy link

To my knowledge, Safari does not ship the SpeechRecognition part of this API. I can help summon the right people to comment on the WebKit and/or Apple stance on this.

This thread is somewhat long and I'm not totally sure what the key issues here are. I see permissions model (where this could result in over-prompting) and integration with MediaStreams (not entirely unrelated). Anything else?

@marcoscaceres
Copy link
Contributor

@othermaciej, other things:

  • it introduces new *List types, where either existing types or WebIDl interable would suffice.
  • it seems to introduce its own error handling and codes - which is something we generally try to avoid.
  • the mutable serviceURI attribute seems like it could be easily abused.
  • does SpeechRecognition really require a constructor? Are there use cases for having multiple concurrent SpeechRecognition instances processing speech at once?
  • It looks like it could benefit from being a streams-based API.
  • other privacy concerns around querying the OS for things it shouldn't be querying.

@smaug----
Copy link
Collaborator

* does SpeechRecognition really require a constructor? Are there use cases for having multiple concurrent SpeechRecognition instances processing speech at once?

yes. You can for example have page level recognition and then another one which is activate only when some particular element has focus. Though, that may depend a bit whether one uses specific grammars or rely on grammar free recognition.

And indeed the API could use modernization, but as I said, it is a very old API. The draft is mostly from 2012, https://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0001.html https://w3c.github.io/speech-api/speechapi.html so many of the recommendation how to create APIs have changed since then.

@marcoscaceres
Copy link
Contributor

And indeed the API could use modernization, but as I said, it is a very old API.

Sure, but the API itself has evidently been actively updated throughout much of 2018:
https://github.com/w3c/speech-api/commits/master

Taking the first cut of a 2012 API and expecting it not to evolve is inexcusable - specially given that it didn't undergo formal standardization, and evidently has had limited privacy review. We are constantly making APIs better on the platform. So, it might be that with a few tweaks, backwards compatibility can be retained, while also improving the overall shape and ergonomics of the API. The spec also lacks most implementation details, it only describes intended behavior, but not how to actually implement that behavior: there are no algorithms, it only loosely defines event ordering, and so on.

Developer ergonomics and lack of implementation details aside, the imma attribute seems highly problematic. It wants to expose a live XML Document using this extremely complex XML format: https://www.w3.org/TR/emma/ - I don't expect any browser vendor to seriously implement that. That seems harmful.

I'm still wondering about what the speech recognition API is trying to achieve? The use case for filling form fields seems somewhat unnecessary, given that OSs already provide this capability (at least, iOS's keyboard lets me dictate into text fields... and I can do the same in MacOS). And I imagine everyone involved with the Speech API has seen the upcoming Voice Control on MacOS:

Please watch https://youtu.be/v72nu602WXU if you haven't.

Is there a risk that this API could conflict with technologies like Voice Control? that could be quite harmful to accessibility - even if the web application is well intentioned.

From a privacy perspective, the SpeechSynthesis API is a finger printing bonanza: the APIs has the ability to detect when new voices are installed on a user's system (voiceschanged event) - that's bad. But we then have things like voiceURI - and the ability to enumerate installed voices on the OS, which is akin to enumerating fonts or plugins. The current API is thus quite harmful from a privacy perspective - so removing voiceURI, voiceschanged, and the ability to ONLY enumerate a small/standard subset of voices is an absolute must.

With those things above fixed, SpeechSynthesis could be quite a nice addition to the platform - maybe it should be split out into a separate spec?

For SpeechRecognition part still needs a lot of work, IMHO. In its current state, and given all the things about, I'd be inclined to position this API as "harmful"... but with the potential to be improved and become something quite useful.

@smaug----
Copy link
Collaborator

One of the key use cases for Web Speech API is multimodal applications.
The classical example being a map application where one uses a pointer to circulate an area and same time saying "show the restaurants in this area"

https://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/#use-cases has more

I'm having hard to see how SpeechRecognition is harmful. Yes, permission stuff may need tweaking and API could use modernization, but is that enough to say it is harmful?
Yes, the spec could definitely use some rewriting and clarifications and possibly removal of unneeded features etc.

There has been multiple proposals over the years for handling speech on the web:
at least X+V, SALT, CSS-MMI, and most recent is Web Speech API.
Based on my experience with all of them, the last one is getting closest to be easy to use.

@youennf
Copy link

youennf commented Jul 13, 2019

Given the shape and state of implementation of the current chrome-only API, it also makes sense to me to push for a more modern version.
Existing APIs, the current WebSpeech API and SFSpeechRecognizer for instance, would certainly help a lot in the design.

SFSpeechRecognizer is supporting both microphone content and recorded content for instance.
Use cases such as transcribing your podcast or getting subtitles from a WebRTC P2P call might require shaping the API (both input and output) differently if considered in scope.

The use of SFSpeechRecognizer is also guarded in native apps by a one-time user prompt for both microphone content and recorded content cases.

@wisniewskit
Copy link

Not to rock the boat too much, but it's worth noting that this conversation is unfortunately rather late to the game, given that websites are already using this feature for production apps (a few of them can be seen on the relevant Bugzilla bug as see-also links to webcompat.com). It may be a long road to find all of them and convince them to care if we cannot get buy-in from Chromium to deprecate and remove the feature in a timely manner.

@marcoscaceres
Copy link
Contributor

@wisniewskit, so yes, site are using it - but usage is extremely low, and interoperability is even less so. We are already collaborating with Google in removing a bunch for dead/silly stuff from the spec... we may be able to refactor it into something useful/usable 🤞.

@andrenatal
Copy link

andrenatal commented Aug 10, 2019

Usage is low because (among other reasons) even though Chrome has this API enabled since circa 2011 (iirc), strategically for Google the web was never their top priority for spreading the adoption of such technology, but Android, so they could lock developers and users into their ecosystem, and the same applies to other browser vendors. Even these days, when they started to support voice on desktops (pixelbooks), they ship Google Assistant, like Siri on MacOS, for example.

Having said that, and being on this space for so long, I don't see any other company able (and interested) to advance the web by bringining speech to empower developers besides Mozilla, since they tend to promote their own proprietary platforms, and if we don't do so, no one will do it, and the web will continue being left behind (although seems Edge started to support it on 76 https://caniuse.com/#search=speech, probably due their switch to Blink).

@wisniewskit
Copy link

For the record, I of course recognize that Mozilla should try to usher in a better API, and that usage of the existing one is low. But frankly speaking, neither of those things matter unless:

  • we create a demonstrably better new API, before usage of the old one gets out of hand
  • we can convince other browsers to switch over to that new API
  • we can convince web apps/sites to switch over to that new API

Over the past year I have only seen the reported webcompat.com issues involving speech APIs increasing, not decreasing. I cannot see that number ever slowing down now that Edge also supports the API (especially given that both Google and Microsoft already using it on their sites and services for quite some time now).

As such I feel we need to prioritize at least getting strong public buy-in from major existing users of such APIs relatively soon -- presumably the likes of Google, Microsoft and DuoLingo -- or we'll just end up losing this battle or having to support two APIs. Ideally, we'll find that there is already a desire to "do better". But I do feel we have to err on the side of caution here, given how the likes of CSS scrollbar APIs and touch/pointer events worked out.

@marcoscaceres
Copy link
Contributor

marcoscaceres commented Aug 10, 2019

@wisniewskit, I think I’ve miscommunicated the issues with the API - and for that I’m sorry. The problem is not that the whole API is broken and we should do away with it: On the contrary, people here are arguing that we should fix what we can, because in its current state it’s harmful.

I’m not sure how versed you are at reading specs, but if you take a look at the actual spec you will see that there are parts of the API that are either impossible to implement in an interoperable manner or the spec doesn’t say what to do: to be blunt, the spec hardly qualify as a spec at all... its more of a wish list thinly disguised as technical speciation only because it uses a W3C stylesheet: There are no algorithms. There is basically zero specified error handling. The eventing model is a total mystery. And much of it is just hand waving that magical things will happen and speech will somehow be generated/recognized (see the grammars section of the spec for a good hardy chuckle).

If you take a look at the spec repository, you will see that we’ve been actively working together with folks from Google to remove some of the things that are in the spec, but Google didn’t actually implement.

Please also note that Edge shipping it doesn’t mean anything at this point. Edge is just taking whatever is in Blink, so the changes we are making to the spec will just automatically appear there.

With that said, yes, Mozilla is totally the only player in this space that can open this up and make things happen (as @andrenatal rightfully points out). However, to do that, we need a spec that’s actually implementable and can be made interoperable, has a proper permissions model, and doesn’t add a ton of fingerprinting entropy.

So while there will be “compat bugs”, without a proper spec, we will be forced to go read Chrome’s source code to actually see what we are supposed to be implementing. The spec, in its current state, ain’t gonna help us.

Hopefully that clarifies things.

@marcoscaceres
Copy link
Contributor

A few of us are working to address the concerns that were identified in this thread above. As Emerging Technologies would like to ship (see intent to ship), we should mark it as "important".

Steps forward here are to update the spec to address the things we discussed above.

@marcoscaceres
Copy link
Contributor

Based on discussions and given that it's only shipping in Nightly, the above suggested position should actually be "worth prototyping".

@guest271314
Copy link

Relevant to "security" concerns and Chromium's implementation of SpeechRecognition, for several years users' voices were recorded and sent to a remote web service without any prompt or permission (https://bugs.chromium.org/p/chromium/issues/detail?id=816095; WICG/speech-api#56). It is still not clear if the recorded voices were/are stored forever and further used for proprietary purposes undisclosed to the user.

Re TTS, technically it is already possible, with sufficient effort, to capture audio output of speak() at Firefox w3c/mediacapture-main#650 and Chromium guest271314/SpeechSynthesisRecorder#14 (comment), to a MediaStreamTrack though that behaviour is not specified, thus an author of the Media Capture and Main specification can simply state clarifying such behaviour is not within the restrictive bounds of the existing specification.

The technology to carry out the existing Web Speech API could certainly be updated to expose the direct socket connect to speech-dispatcher (https://stackoverflow.com/questions/48219981/how-to-programmatically-send-a-unix-socket-command-to-a-system-server-autospawne). For example, in some cases the socket connect can remain open even after the browser is closed. And to make it clear that getUserMedia() can both accept input for SpeechRecognition and output audio that the user has control over via MediaStreamTrack for speechSynthesis.speak().

A start on SSML parsing WICG/speech-api#10 for TTS https://github.com/guest271314/SpeechSynthesisSSMLParser. Note, due to lack of development and deployment consensus re SSML on the web platform, the last time that checked *mazaon "Alexa"; and "Polly"; *BM "Watson Bluemix"; *oogle "Actions"; etc. each parsed <s> and <p> elements differently https://github.com/guest271314/SpeechSynthesisSSMLParser/blob/master/SpeechSynthesisSSMLParserTest.html#L258, inserting a pause or not, arbitrarily. Which browser that implements SSML parsing can tell them they are doing it right or wrong? AFAICT, no browser is shipped with SSML parsing.

What is remarkable is when the topic of TTS/SST is raised by users who have been attempting to implement functionality or create workarounds is an invariable question about uses cases. All that is required to answer the question of use cases is the amount of resources private concerns have invested into the technology over the past 10 years. Some very basic cases https://lists.w3.org/Archives/Public/public-speech-api/2017Jul/0004.html. Other use cases involve creating an audio book; an individual had a tooth pulled though wants to send a voice message; developing FOSS SST/TTS locally (espeak-ng/espeak-ng#669). Consider this well when asking about use cases and compelling interest: There are ads for *mazaon's "Alexa" service on in every media domain that advertises.

The answer to the compelling need and use cases is the amount of resources *mazon, *oogle, *BM, and other private concerns have invested into SST/TTS technology over the past 10 years, their combined projected investment into research over the next 10 years, and the profit those same concerns have made in the SST/TTS domain.

Mozilla should implement SST locally, for example, using Pocket Sphinx, or other FOSS code, and avoid using some remote service at all cost, even if it takes N more years to develop sufficient FOSS technology to implement SpeechRecognition.

@guest271314
Copy link

Re https://bugzilla.mozilla.org/show_bug.cgi?id=1248897

Is the speech recognition code at the Mozilla end-point free open source software?

Why is the speech reccognition code not shipped with the browser?

@guest271314
Copy link

Relevant to Text to Speech (W3C Web Speech API SpeechSynthesisUtterance() and speechSynthesis.speak()) one long-standing issue that can be fixed very simply is by setting SSML parsing to "on" when establishing the auto-spawned connection to speech-dispatcher

How to set SSML parsing to on at user configuration file?

Looking at the actual trackers on chromium, I can confirm that the bug is really there. Something like the attached patch would be just enough to fix it there for everybody, instead of working around it only for the few people who happen to know about it.

Proof-of-concept using Native Messaging, Native File System, espeak-ng, which ChromiumOS authors decided to use with WASM instead of using Web Speech API (https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome).

// close <break/> tags
let input = `<speak version="1.0" xml:lang="en-US">
    Here are <say-as interpret-as="characters">SSML</say-as> samples.
    Try a date: <say-as interpret-as="date" format="dmy" detail="1">10-9-1960</say-as>
    This is a <break time="2500ms"/> 2.5 second pause.
    This is a <break/> sentence break.<break/>
    <voice name="Storm" rate="x-slow" pitch="0.25">espeak-<say-as interpret-as="characters">ng</say-as> using Native Messaging, Native File System </voice>
    and <voice name="English_(Caribbean)"> <sub alias="JavaScript">JS</sub></voice>
  </speak>`;
  
input = (new DOMParser()).parseFromString(input, "application/xml"); // XML Document
  
nativeMessagingEspeakNG(input)
.then(async({input, phonemes, result}) => {
  // do stuff with original input as text, or SSML string, phomenes of input, result: ArrayBuffer
})
.catch(console.error);

There is no technical reason why SSML parsing functionality for Web Speech API cannot be implemented right now.

@guest271314
Copy link

Should not have had to write this code native-messaging-espeak-ng just to set -m option for the local speech synthesis engine (which Web Speech API communicates with via speech-dispatcher).

The user at the default browser configuration (where speech-dispatcher is enabled) should be able to set SSML parsing to "on". Or, provide an objective reason why SSML parsing is not turned on by default, or at least provide the option to the user to set that option to on or set that flag to on in speech-dipatcher configuration file, e.g., ~/.config/speech-dipatcher/speechd.conf which the browser should read and set the options therefrom for all communication with the local server of speech synthesis responsible for outputting user media.

For the use case of "Support SpeechSynthesis to a MediaStreamTrack" (WICG/speech-api#69) several options are available. The simplest, from perspective here, would be to specify that getUserMedia() can select "Monitor of <device>" where such device is available. That option was foreclosed by Media Capture and Streams authors at their end, in summary, see w3c/mediacapture-main#654.

Users actually do expect this functionality to be present in the browser by default, from perspective, though not explicitly stated, given the state of the art to do so rumkin/duotone-reader#3

I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system. But I understand it's not possible in the moment, so I'll be searching a way to make communicating with WhatWG.

Where if had relied on waiting for SSML parsing to be implemented by browsers that are calling the local binary that has that option available, would not have written SpeechSynthesisSSMLParser
where found that web services and the speech synthesis engine itself could parse SSML differently, particularly with regard to <s> and <p> elements. Given Web Speech API has not taken ownership of SSML parsing and extensibility which body in "web platform" land can dispute the implementation of SSML parsing by *BM, *mazon, *oogle?

@guest271314
Copy link

Re either in the Web Speech API as to text or SSML input to SpeechSynthesisUtterance see WICG/speech-api#10 (comment)

This looks to me like a chicken-and-egg issue.

If nothing in the spec says whether .text is ssml or not, I don't see how implementations would dare telling the synth that it's ssml when it's not even sure whether it is or not.

The backends are ready: espeak-ng supports it, speech-dispatcher supports it, it's a one-line change in speech-dispatcher-based browser backends to support it. But if the API does not tell when it should be enabled, we can not dare enabling it.

It is essentially impossible to proceed without disregarding sanity without the specification being clear in this regard. The technology and infrastructure is already available and ready to go.

@guest271314
Copy link

guest271314 commented May 23, 2020

SSML parsing is ready to go (be implemented without workarounds). Issues are already filed at Chromium and Firefox bug trackers (linked at #170 (comment)).

@dbaron dbaron added venue: W3C CG Specifications in W3C Community Groups (e.g., WICG, Privacy CG) and removed venue: W3C Specifications in W3C Working Groups labels Jun 5, 2020
@guest271314
Copy link

One concrete item that can be fixed is censorship of words spoken by users

https://wicg.github.io/speech-api/#dom-speechrecognitionalternative-transcript

In pertinent part

transcript attribute, of type DOMString, readonly
The transcript string represents the raw words that the user spoke.

which Chrome, Chromium is willfully not conforming to by deliberately censoring "raw words" input by the user and, or words printed at transcript.

Issue 804812: Remove censorship from webkitSpeechRecognition https://bugs.chromium.org/p/chromium/issues/detail?id=804812

Instead of fixing the issue, Chromium authors have failed to fix the issue and instead have the aim of closing the issue without fixing the issue, byt just allowing the issue to sit un-fixed. Repugnant.

@marcoscaceres
Copy link
Contributor

@guest271314, your comment about Chrome is out of line. Please refrain from expressing your frustration with another browser project here.

@guest271314
Copy link

@marcoscaceres Explain exactly what you mean. Be specific. There is no "frustration". There is scientific observation.

Am completely over hypocrites attempting to get on a high horse and holler about some moral standard or selective "code of conduct" when can point directly to evidence unequivocally demonstrating their lack of being "in line", whatever that is, no matter the disclipline or field of human activity, there is no escaping that I will find your errors and omissions and expose them to the universe to sort out.

@cwilso
Copy link

cwilso commented Jun 24, 2020

To Marcos' point, you are tilting at the wrong windmill. This is Mozilla's standards-positions repo, where they comment on Mozilla's support for various standards proposals. Complaining about what issues Chrome has or hasn't fixed in their implementation of a proposal simply isn't relevant here. Continue advocating for your bug; tweet about it or whatever. Comment on the API spec itself, suggest that ceonsorship should be explicitly forbidden, and see if there is support for that view. Haranguing Chrome on Mozilla's standards-position issue IS out of line, irrelevant, and useless in terms of potentially getting your issue fixed.

@guest271314
Copy link

@marcoscaceres The Web Speech API makes no mention of censorship. Yet, you appear to be supporting that proposition. Or, at least deciding to ignore that implementation and instead focusing on the individual who is letting you (whatever hat you are wearing today) that Chromium, Chrome is non-conformant with the specification.

Really have no clue what you mean by "out of line". You can ban all you want, but you will never have the rank to say what post is "out of line". The best you will ever be able to do is impose some absurd 1,000 year ban.

How about the novel idea of telling Chrome to stop selectively censoring content. Next they will censor history directly from the
primary sources. But you were warned about the censorship of arbitrary English words right here and now. Do not care which browser it is, they supposedly all fall under implementation of Web Speech API. We know that Chrome sends their user captured recording to a remote server - their remote server - so they can change the code that censors the input without any delay.

Is your suggestion that users be filled with joy when their content is censored?

Mercy is for the weak, when I speak I scream -2Pac

@guest271314
Copy link

@cwilso

Continue advocating for your bug; tweet about it or whatever. Comment on the API spec itself, suggest that ceonsorship should be explicitly forbidden

Am banned from Twitter for no reason. They do not like facts, which is all that post: primary sources. Am banned from contributing to the Web Speecj API specification courtesy of @marcoscaceres and the fiasco of actually signing up for the organization WICG and W3C then being blocked and banned from both, from WICG for "1,000 years".

Was referred to this issue via a Media Stream and Capture issue. Had the sense you folks were trying to get the Web Speech API working in a compatible way. That is evidently not the case.

Now, just write code to achieve requirements, instead of trying to ask specifications to change things, as found that WICG and W3C both are want to be hypocritical and selectively ban or otherwise censor content themselves by not allowing input.

Ok. You folks will sort it out, eventually. Perhaps. Cheers.

@guest271314
Copy link

While you are here and focused on Mozilla, since the Web Speech API is not going to be discarded and began anew the least you can do is implement the flag for speech-dispatcher to enable SSML parsing, which has been technically possible since day 1, WICG/speech-api#10 (comment), whatwg/webidl#880.

Have a great day!

@mhoye
Copy link

mhoye commented Jun 24, 2020

That we aspire to be welcoming hosts does not entitle guests in our house to act as they please, and that's quite enough of that.

If you have something constructive you'd like to add to this discussion and find that you cannot, feel free to contact me directly to discuss it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
under review venue: W3C CG Specifications in W3C Community Groups (e.g., WICG, Privacy CG)
Projects
Status: Unscreened
Development

No branches or pull requests