-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support SpeechSynthesis *to* a MediaStreamTrack #69
Comments
This is possible to an appreciable degree using the approach in https://github.com/guest271314/SpeechSynthesisRecorder. In order to make the change concrete and direct adjustments to the The code
At the command line we can currently do
where |
Technically using |
|
Kindly compose the specification to do just that so that these workarounds can be retired. |
@cwilso BTW the maintainers of |
I think we're talking at cross purposes. If you think Web Speech should be built differently, dive in to that discussion - I think you're fundamentally saying "Web Speech shouldn't exist, we can already do this with speechd" - but is that really true across different OSes and systems? I think it would be good to have one, relatively simply API to do TTS. I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output). "Plug in some headphones to avoid audio output" is not a usable expectation for users (just like "install a loopback driver" is not a realistic expectation either, for a bunch of scenarios people have asked for in Web Audio). I'm not entirely sure what your implication is here, because I'm saying precisely this - we need the ability to pipe the stream of audio data from a speech utterance to a Media Stream. Doing that through a Web Socket connection set up to speechd with required client code running in the UI thread creating buffersource nodes and decoding and start(0)'ing audio files as they come in seems like a roundabout way of doing this. |
At *nix neither Chrome, Chromium nor Mozilla Firefox, Nightly implementations write their own "speech synthesis engines", no speech syntheis engine in included in the source code (Windows and Mac may be different here). Kindly read the above-linked SO question carefully while cross-referencing the source code of the respective browsers. The implementations of Web Speech API at the former browsers rely entirely on there being "speech synthesis engines" already installed on the local machine which That means that when there is no speech synthesis engine installed locally Web Speech API alone does not perform any speech synthesis. AFAICT the specification does not currently mandate that either speech synthesis nor speech recognition MUST be performed locally. Web Speech API is not a speech engine itself. The same is true for speech recognition, perhaps save for Android and iOS handheld devices. To change which "speech synthesis engines" are used you can execute
Agree, that should be an option, or as you appear to suggest, default.
That was merely stated to verify that what is being recorded is not the microphone, but rather, audio output. If you open sound setting at *nix while Yes, a rooundabout way, though very possible. Native Messaging can also be utilized, to avoid having to interface with Web Speech API at all, as very little significant changes have been made since the specification was published. A A Native Messaging requires loading the code at If you are trying to actually fix the specification, write the words that will do just that. If you are trying to achieve the requirement in spite of the current specification, options are available, e.g. at the front-end you can use If you are trying to do both you can achieve the expected result while composing the PR. |
You do not have to create a buffer source. You can |
I'll just note it'd be more fitting to have a MediaStreamTrack be the output rather than a MediaStream, unless there are requirements that the stream must be exposed early on, and output (tracks) must come and go throughout its lifetime. It seems to me that |
@Pehrsons you're right, it would probably be more fitting to use MediaStreamTrack. For the use cases I listed, I think it would make more sense to have a more long-lasting MediaStreamTrack than a single Utterance, and also it's critically important to NOT send that output to the main audio output as well. (E.g. this should be maybe an optional parameter to speak(), or a mode you set up via SpeechSynthesis.getMediaStreamTrack() (and release somehow when you're done)). Creating and destroying MediaStreamTracks for every utterance would seem to be costly and prone to causing audio artifacting. |
(Changed title to reflect @Pehrsons' suggestion.) |
I think a long-lived MediaStreamTrack is fine as long as there is something in SpeechSynthesis making it end eventually (i.e., so garbage collection of SpeechSynthesis cannot be observed through the track's ended event). That said, As an implementer of mediacapture APIs in Firefox I don't think having multiple tracks is prone to cause audio artifacting. If there was, that would be a bad implementation of playback of MediaStreamTracks. Whether creating and destroying tracks for every utterance is costly depends on perspective I guess. How long would an utterance be? Are we talking one track per ten seconds or hundreds per second? I assume the former. Garbage collecting lots of objects can be noticable, but "lots" might have to be fairly high for that, even for a mobile device. Note: this is anecdotal, I don't have data to back it up. Let's also not forget what performance impact a muted MediaStreamTrack might have. In Firefox it means we keep an audio stream open towards the OS because the track can become unmuted at any time (in other MediaStreamTrack APIs muted tends to be shortlived, i.e., it will be unmuted as soon as the connection is set up, the decoder has finished seeking, etc.). If there's a reference to (an idle) SpeechSynthesis object keeping the muted track alive, that might cause quite the power drain. |
Until a speech synthesis engine is shipped with the browser and provides a means to get a At Chrome or Chromium open DevTools, select
the local code can be PHP, Python, bash, or other preferred programming language. In general, the same code can be run using
|
@cwilso This should capture only audio ouput, precisely when
Firefox throws an
|
It would be very helpful to be able to get a stream of the output of SpeechSynthesis.
For an explicit use cases, I would like to:
(This is similar/inverse/matching/related feature to #66.)
The text was updated successfully, but these errors were encountered: