-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web Speech API #170
Comments
At a high level, the ability to access speech recognition and synthesis capabilities on the web is reasonable. Providing access to speech can do a lot to improve agency and accessibility on the web, and I don't see any significant problems with providing platform features that improve that. Much of these capabilities are already available through existing APIs, but the burden of providing the recognition/synthesis parts are borne by sites. This does a lot to improve access for all sites, not just those with the resources to build those systems. However, I think that the form of the proposed API is not entirely consistent with our principles on how we might build those features. These seem to be driven by the Android speech APIs without a lot of consideration for the APIs already present on the web. This could and probably should operate on media streams, both for input to recognition and for output of synthesis. In particular, this should use the For the text, it would be good to understand how this might be mapped to WebVTT and its APIs and other APIs that operate on audio. I would also like to gather the views of those who manage editor and keyboard input. I think that the recognition parts would benefit a lot from their views and experience. For instance, it would be good to understand how much - if at all - access to speech recognition might best be integrated with text input in In thinking through the security of this, there is one potential trap. If this is built as it is defined, then speech recognition might be tuned to an individual. However, if a site can provide arbitrary audio input to the engine as I suggest, then any tuning might be exploited by a site to learn about individuals. Minimally, that is a fingerprinting exposure. But it might also go beyond that to learn about the speech (including characteristics of voice and language, like gender) of the person using the computer, without requiring any permissions. That should be easy enough to defend against, but it's a consideration that could easily be overlooked. |
Yes—maybe even on audio MediaStreamTracks directly (unless there are plans for lipreading?) Relying on For example, in Firefox, the existing API ends up requesting permission from the end-user each time the speech API is called by default. This is undesirable, but the current API invites it, and leaves us and browsers like us with no way to fix this problem. The only workaround atm is to rely on web developers knowing that current implementations AFAIK call A quick band-aid API fix would perhaps be for const recognition = new SpeechRecognition();
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
recognition.start(stream.getAudioTracks()[0]); This quick change might be worth fighting for before considering releasing this API? On privacy, to me, the risks seem to all stem primarily from microphone permission, since, as mentioned, JS can already do speech recognition in software. I agree the risks are worth mentioning however, maybe even in the spec. |
By whom?
If an attacker had a speech corpus tagged with gender, a much easier attack would be to simply train a machine learning model to recognize gender, then use the normal microphone access to send audio to their model. The attack suggested is far more complicated. The attacker would have to pass their corpus to the STT engine, record the STT engine's response, create another machine learning model that learned the mistakes made by the STT engine that were correlated with gender. The the attack would be to send speech to the STT engine, get the transcript back, send the transcript to their engine, have their engine guess gender. So using the STT engine to determine gender would add an unnecessary call to the STT ending and decrease the accuracy of the gender determination. So in other words there are better ways of determining gender that are already present in the browser. |
This is, with more details, the point I wanted to emphazise above. |
If the browser runs the model, it makes sense for the model to be tuned for the person who uses the browser.
I am talking about the scenario where the site does not have access to the microphone. If microphone access is granted, then we have to assume that the site will learn anything exposed by that, but if an arbitrary stream can be fed into the recognition engine, then that can be used to probe the state of the engine. If that state includes information that we might prefer to consider private (for which I only used gender as an example), then that would not be good.
I don't mean to imply that this is easy, but when talking about attacks, we have to consider possibilities regardless of their complexity. A low complexity attack would just use the state of the voice recognition engine for fingerprinting. Sites have already proven a willingness to go to extreme lengths to fingerprint people and this doesn't seem that complicated. |
A few points on this..
Probing the state of a STT engine in the way you suggest, using the mic, is possible without the WebSpeech API by using standard mic access and any number of existing STT services. My point is that we should be considering new lines of attack that are being opened as a result of the WebSpeech API.
Fair enough. But my point is that the attack you mention is already possible without the WebSpeech API and easier to do without using the WebSpeech API. |
I think that I wasn't clear enough. My point is that this information might be exposed without a permissions prompt. That is, if the model might be used on arbitrary audio, then asking for permission for a microphone is not necessary. I don't see any reason, or even a plausible excuse, for asking for permission to use this API. The risk profile for features that aren't permission-gated is very different. |
OK. Got it! If the Web Speech API is implemented as per the spec, see Security and privacy considerations then only an incorrect implementation would not prompt the user. For example, "Accepting a permission prompt shown as the result of a call to SpeechRecognition.start" would always prompt the user regardless of the audio source. |
That very much assumes the current design. I should have said that the questions regarding permission-free access only applies if you take the suggestion to better integrate with other existing APIs. If you build the API as specified (which would not be good for the platform as a whole in my opinion), then of course you would want to have some sort of permission scheme with notice about recording. |
So you suggest that to better integrate existing API we should remove permission prompts? This cure seems worse than the disease. What in the existing standard is so problematic that it is worse than removing permission prompts? |
The current API is tightly coupled with microphone permission for no clear reason, causing a permission prompt in Firefox every time the speech API is called. It's biased toward Chrome's permission model. For users with more than one microphone, once they persist permission, there is no API for selecting which microphone to use. There's no way to recognize speech from audio sources other than local mic, e.g. from a WebRTC call. There's no way to pre-process audio—e.g. with web audio—before feeding it to the speech api. There's no API for controlling noise suppression, gain control or echo cancellation. There's no way to direct output to something other than local speakers, e.g. send over a WebRTC call. Interactions with |
Sounds like there is good opportunity for the WebRTC folks and Web Speech folks to work together to share expertise on getting better permissions integration and overall API cohesion. Conceptually, enabling "Web Speech" does seem like something we are pursuing at Mozilla/Firefox - so this might already be in the "worth prototyping" realm - but with the proviso that the API maybe need some refinements. Would that be a fair assessment? It definitely sounds like there are some real opportunities to collaborate. |
@jan-ivar Most of your comments seem like, more-or-less, reasonable things we could change We are not in production with a standard or non-standard version of the API, where as Google, The alternative, which seems to be what you are suggesting, is simply to ignore the official API, I'd be interested in understanding how and when, or if, you want to get your suggestions into |
I think what we are proposing is a set of changes to improve how the API works. And that could also help with privacy, security, and overall web-platform integration. Yes, that will likely come with some breaking changes at the cost of interop. And yes, we will need to convince our friends at Google and Apple to change their implementations - and with very good reason. But the spec is by no means "official"... it's a Community Group Draft and thus subject to change if we want it to be standardized officially. Web Standards are all about making these compromises.
Kindly please don't undersell Mozilla's position: We've always been constructive contributors to specifications and our suggestions have always been welcomed and taken onboard by the web community. We may not have 60% marketshare, but we do have a lot of users and our opinions matter as equally as anyone else in the community (irrespective of marketshare). |
I'm happy to say "worth prototyping" as long as the text made it clear that we think that the API is far from ideal in its current form. |
I could note that the API is basically from 2011, well before getUserMedia or anything like that. And Chrome used to have something like |
AFAIK only Chrome implements All but two of the concerns I raised seem solved by adding an optional A more aggressive attempt to move the needle on other implementations might include lobbying for this If we don't do that—meaning: if we don't bend web developers to use a different API surface where they obtain microphone permission separately using A counterpoint would be if having access only to speech results and not the audio itself constitutes a lower privacy risk profile, but I don't think it does. |
To my knowledge, Safari does not ship the SpeechRecognition part of this API. I can help summon the right people to comment on the WebKit and/or Apple stance on this. This thread is somewhat long and I'm not totally sure what the key issues here are. I see permissions model (where this could result in over-prompting) and integration with MediaStreams (not entirely unrelated). Anything else? |
@othermaciej, other things:
|
yes. You can for example have page level recognition and then another one which is activate only when some particular element has focus. Though, that may depend a bit whether one uses specific grammars or rely on grammar free recognition. And indeed the API could use modernization, but as I said, it is a very old API. The draft is mostly from 2012, https://lists.w3.org/Archives/Public/public-speech-api/2012Apr/0001.html https://w3c.github.io/speech-api/speechapi.html so many of the recommendation how to create APIs have changed since then. |
Sure, but the API itself has evidently been actively updated throughout much of 2018: Taking the first cut of a 2012 API and expecting it not to evolve is inexcusable - specially given that it didn't undergo formal standardization, and evidently has had limited privacy review. We are constantly making APIs better on the platform. So, it might be that with a few tweaks, backwards compatibility can be retained, while also improving the overall shape and ergonomics of the API. The spec also lacks most implementation details, it only describes intended behavior, but not how to actually implement that behavior: there are no algorithms, it only loosely defines event ordering, and so on. Developer ergonomics and lack of implementation details aside, the I'm still wondering about what the speech recognition API is trying to achieve? The use case for filling form fields seems somewhat unnecessary, given that OSs already provide this capability (at least, iOS's keyboard lets me dictate into text fields... and I can do the same in MacOS). And I imagine everyone involved with the Speech API has seen the upcoming Voice Control on MacOS: Please watch https://youtu.be/v72nu602WXU if you haven't. Is there a risk that this API could conflict with technologies like Voice Control? that could be quite harmful to accessibility - even if the web application is well intentioned. From a privacy perspective, the SpeechSynthesis API is a finger printing bonanza: the APIs has the ability to detect when new voices are installed on a user's system ( With those things above fixed, For |
One of the key use cases for Web Speech API is multimodal applications. https://www.w3.org/2005/Incubator/htmlspeech/XGR-htmlspeech-20111206/#use-cases has more I'm having hard to see how SpeechRecognition is harmful. Yes, permission stuff may need tweaking and API could use modernization, but is that enough to say it is harmful? There has been multiple proposals over the years for handling speech on the web: |
Given the shape and state of implementation of the current chrome-only API, it also makes sense to me to push for a more modern version. SFSpeechRecognizer is supporting both microphone content and recorded content for instance. The use of SFSpeechRecognizer is also guarded in native apps by a one-time user prompt for both microphone content and recorded content cases. |
Not to rock the boat too much, but it's worth noting that this conversation is unfortunately rather late to the game, given that websites are already using this feature for production apps (a few of them can be seen on the relevant Bugzilla bug as see-also links to webcompat.com). It may be a long road to find all of them and convince them to care if we cannot get buy-in from Chromium to deprecate and remove the feature in a timely manner. |
@wisniewskit, so yes, site are using it - but usage is extremely low, and interoperability is even less so. We are already collaborating with Google in removing a bunch for dead/silly stuff from the spec... we may be able to refactor it into something useful/usable 🤞. |
Usage is low because (among other reasons) even though Chrome has this API enabled since circa 2011 (iirc), strategically for Google the web was never their top priority for spreading the adoption of such technology, but Android, so they could lock developers and users into their ecosystem, and the same applies to other browser vendors. Even these days, when they started to support voice on desktops (pixelbooks), they ship Google Assistant, like Siri on MacOS, for example. Having said that, and being on this space for so long, I don't see any other company able (and interested) to advance the web by bringining speech to empower developers besides Mozilla, since they tend to promote their own proprietary platforms, and if we don't do so, no one will do it, and the web will continue being left behind (although seems Edge started to support it on 76 https://caniuse.com/#search=speech, probably due their switch to Blink). |
For the record, I of course recognize that Mozilla should try to usher in a better API, and that usage of the existing one is low. But frankly speaking, neither of those things matter unless:
Over the past year I have only seen the reported webcompat.com issues involving speech APIs increasing, not decreasing. I cannot see that number ever slowing down now that Edge also supports the API (especially given that both Google and Microsoft already using it on their sites and services for quite some time now). As such I feel we need to prioritize at least getting strong public buy-in from major existing users of such APIs relatively soon -- presumably the likes of Google, Microsoft and DuoLingo -- or we'll just end up losing this battle or having to support two APIs. Ideally, we'll find that there is already a desire to "do better". But I do feel we have to err on the side of caution here, given how the likes of CSS scrollbar APIs and touch/pointer events worked out. |
@wisniewskit, I think I’ve miscommunicated the issues with the API - and for that I’m sorry. The problem is not that the whole API is broken and we should do away with it: On the contrary, people here are arguing that we should fix what we can, because in its current state it’s harmful. I’m not sure how versed you are at reading specs, but if you take a look at the actual spec you will see that there are parts of the API that are either impossible to implement in an interoperable manner or the spec doesn’t say what to do: to be blunt, the spec hardly qualify as a spec at all... its more of a wish list thinly disguised as technical speciation only because it uses a W3C stylesheet: There are no algorithms. There is basically zero specified error handling. The eventing model is a total mystery. And much of it is just hand waving that magical things will happen and speech will somehow be generated/recognized (see the grammars section of the spec for a good hardy chuckle). If you take a look at the spec repository, you will see that we’ve been actively working together with folks from Google to remove some of the things that are in the spec, but Google didn’t actually implement. Please also note that Edge shipping it doesn’t mean anything at this point. Edge is just taking whatever is in Blink, so the changes we are making to the spec will just automatically appear there. With that said, yes, Mozilla is totally the only player in this space that can open this up and make things happen (as @andrenatal rightfully points out). However, to do that, we need a spec that’s actually implementable and can be made interoperable, has a proper permissions model, and doesn’t add a ton of fingerprinting entropy. So while there will be “compat bugs”, without a proper spec, we will be forced to go read Chrome’s source code to actually see what we are supposed to be implementing. The spec, in its current state, ain’t gonna help us. Hopefully that clarifies things. |
A few of us are working to address the concerns that were identified in this thread above. As Emerging Technologies would like to ship (see intent to ship), we should mark it as "important". Steps forward here are to update the spec to address the things we discussed above. |
Based on discussions and given that it's only shipping in Nightly, the above suggested position should actually be "worth prototyping". |
Relevant to "security" concerns and Chromium's implementation of Re TTS, technically it is already possible, with sufficient effort, to capture audio output of The technology to carry out the existing Web Speech API could certainly be updated to expose the direct socket connect to A start on SSML parsing WICG/speech-api#10 for TTS https://github.com/guest271314/SpeechSynthesisSSMLParser. Note, due to lack of development and deployment consensus re SSML on the web platform, the last time that checked *mazaon "Alexa"; and "Polly"; *BM "Watson Bluemix"; *oogle "Actions"; etc. each parsed What is remarkable is when the topic of TTS/SST is raised by users who have been attempting to implement functionality or create workarounds is an invariable question about uses cases. All that is required to answer the question of use cases is the amount of resources private concerns have invested into the technology over the past 10 years. Some very basic cases https://lists.w3.org/Archives/Public/public-speech-api/2017Jul/0004.html. Other use cases involve creating an audio book; an individual had a tooth pulled though wants to send a voice message; developing FOSS SST/TTS locally (espeak-ng/espeak-ng#669). Consider this well when asking about use cases and compelling interest: There are ads for *mazaon's "Alexa" service on in every media domain that advertises. The answer to the compelling need and use cases is the amount of resources *mazon, *oogle, *BM, and other private concerns have invested into SST/TTS technology over the past 10 years, their combined projected investment into research over the next 10 years, and the profit those same concerns have made in the SST/TTS domain. Mozilla should implement SST locally, for example, using Pocket Sphinx, or other FOSS code, and avoid using some remote service at all cost, even if it takes |
Re https://bugzilla.mozilla.org/show_bug.cgi?id=1248897 Is the speech recognition code at the Mozilla end-point free open source software? Why is the speech reccognition code not shipped with the browser? |
Relevant to Text to Speech (W3C Web Speech API
Proof-of-concept using Native Messaging, Native File System,
There is no technical reason why SSML parsing functionality for Web Speech API cannot be implemented right now. |
Should not have had to write this code native-messaging-espeak-ng just to set The user at the default browser configuration (where For the use case of "Support SpeechSynthesis to a MediaStreamTrack" (WICG/speech-api#69) several options are available. The simplest, from perspective here, would be to specify that Users actually do expect this functionality to be present in the browser by default, from perspective, though not explicitly stated, given the state of the art to do so rumkin/duotone-reader#3
Where if had relied on waiting for SSML parsing to be implemented by browsers that are calling the local binary that has that option available, would not have written SpeechSynthesisSSMLParser |
Re either in the Web Speech API as to text or SSML input to
It is essentially impossible to proceed without disregarding sanity without the specification being clear in this regard. The technology and infrastructure is already available and ready to go. |
SSML parsing is ready to go (be implemented without workarounds). Issues are already filed at Chromium and Firefox bug trackers (linked at #170 (comment)). |
One concrete item that can be fixed is censorship of words spoken by users https://wicg.github.io/speech-api/#dom-speechrecognitionalternative-transcript In pertinent part
which Chrome, Chromium is willfully not conforming to by deliberately censoring "raw words" input by the user and, or words printed at Issue 804812: Remove censorship from webkitSpeechRecognition https://bugs.chromium.org/p/chromium/issues/detail?id=804812 Instead of fixing the issue, Chromium authors have failed to fix the issue and instead have the aim of closing the issue without fixing the issue, byt just allowing the issue to sit un-fixed. Repugnant. |
@guest271314, your comment about Chrome is out of line. Please refrain from expressing your frustration with another browser project here. |
@marcoscaceres Explain exactly what you mean. Be specific. There is no "frustration". There is scientific observation. Am completely over hypocrites attempting to get on a high horse and holler about some moral standard or selective "code of conduct" when can point directly to evidence unequivocally demonstrating their lack of being "in line", whatever that is, no matter the disclipline or field of human activity, there is no escaping that I will find your errors and omissions and expose them to the universe to sort out. |
To Marcos' point, you are tilting at the wrong windmill. This is Mozilla's standards-positions repo, where they comment on Mozilla's support for various standards proposals. Complaining about what issues Chrome has or hasn't fixed in their implementation of a proposal simply isn't relevant here. Continue advocating for your bug; tweet about it or whatever. Comment on the API spec itself, suggest that ceonsorship should be explicitly forbidden, and see if there is support for that view. Haranguing Chrome on Mozilla's standards-position issue IS out of line, irrelevant, and useless in terms of potentially getting your issue fixed. |
@marcoscaceres The Web Speech API makes no mention of censorship. Yet, you appear to be supporting that proposition. Or, at least deciding to ignore that implementation and instead focusing on the individual who is letting you (whatever hat you are wearing today) that Chromium, Chrome is non-conformant with the specification. Really have no clue what you mean by "out of line". You can ban all you want, but you will never have the rank to say what post is "out of line". The best you will ever be able to do is impose some absurd 1,000 year ban. How about the novel idea of telling Chrome to stop selectively censoring content. Next they will censor history directly from the Is your suggestion that users be filled with joy when their content is censored?
|
Am banned from Twitter for no reason. They do not like facts, which is all that post: primary sources. Am banned from contributing to the Web Speecj API specification courtesy of @marcoscaceres and the fiasco of actually signing up for the organization WICG and W3C then being blocked and banned from both, from WICG for "1,000 years". Was referred to this issue via a Media Stream and Capture issue. Had the sense you folks were trying to get the Web Speech API working in a compatible way. That is evidently not the case. Now, just write code to achieve requirements, instead of trying to ask specifications to change things, as found that WICG and W3C both are want to be hypocritical and selectively ban or otherwise censor content themselves by not allowing input. Ok. You folks will sort it out, eventually. Perhaps. Cheers. |
While you are here and focused on Mozilla, since the Web Speech API is not going to be discarded and began anew the least you can do is implement the flag for Have a great day! |
That we aspire to be welcoming hosts does not entitle guests in our house to act as they please, and that's quite enough of that. If you have something constructive you'd like to add to this discussion and find that you cannot, feel free to contact me directly to discuss it. |
Request for Mozilla Position on an Emerging Web Specification
Other information
The text was updated successfully, but these errors were encountered: