Support SpeechSynthesis to a MediaStreamTrack #69

cwilso · 2019-10-07T15:49:30Z

It would be very helpful to be able to get a stream of the output of SpeechSynthesis.

For an explicit use cases, I would like to:

position speech synthesis in a virtual world in WebXR (using Web Audio's PannerNode)
be able to feed speech synthesis output through a WebRTC connection
have speech synthesis output be able to be processed through Web Audio

(This is similar/inverse/matching/related feature to #66.)

guest271314 · 2019-10-07T16:52:21Z

This is possible to an appreciable degree using the approach in https://github.com/guest271314/SpeechSynthesisRecorder. In order to make the change concrete and direct adjustments to the speechd socket connection (https://github.com/brailcom/speechd and to the degree necessary spd-conf in python3-speechd) can be made, see https://stackoverflow.com/questions/48219981/how-to-programmatically-send-a-unix-socket-command-to-a-system-server-autospawne.

The code

async function SSMLStream({ssml="", options=""}) {
  const fd = new FormData();
  fd.append("ssml", ssml);
  fd.append("options", options);

  const request = await fetch("speak.php", {method:"POST", body:fd});
  const response = await request.arrayBuffer();
  return response;
}

let ssml = `<speak version="1.0" xml:lang="en-US"> 
             Here are <say-as interpret-as="characters">SSML</say-as> samples. 
             Hello universe, how are you today? 
             Try a date: <say-as interpret-as="date" format="dmy" detail="1">10-9-1960</say-as> 
             This is a <break time="2500ms" /> 2.5 second pause. 
             This is a <break /> sentence break</prosody> <break />
             <voice name="us-en+f3" rate="x-slow" pitch="0.25">espeak using</voice> 
             PHP and <voice name="en-us+f2"> <sub alias="JavaScript">JS</sub></voice>
           </speak>`;

SSMLStream({ssml, options:"-v en-us+f1"})
.then(async(data) => {

    let context = new AudioContext();
    let source = context.createBufferSource();
    source.buffer = await context.decodeAudioData(data);
    source.connect(context.destination);
    source.start()

})
// PHP
<?php 
  if(isset($_POST["ssml"])) {
    header("Content-Type: audio/x-wav");
    $options = $_POST["options"];
    echo shell_exec("espeak -m --stdout " . $options . " '" . $_POST["ssml"] . "'");
  };

At the command line we can currently do

espeak-ng -m --stdout > output && ~/dataurl output

where dataurl is a bash script which converts a file to a data URL

guest271314 · 2019-10-07T16:57:40Z

Technically using navigator.mediaDevices.enumerateDevices() and selected "audiooutput" should achieve the requirement WebAudio/web-audio-api#1764 (comment).

cwilso · 2019-10-07T17:05:51Z

I don't dispute that you could use other speech synthesis engines via web sockets or the like; this is explicitly about the Web Speech API interfaces.
You can't use "audiooutput" (it's an output device, not an input device - so this doesn't work in any implementation I know of)
That approach would include ALL the sounds currently being played through the output - which would explicitly defeat the scenarios I suggested.

guest271314 · 2019-10-07T17:20:51Z

@cwilso

Web Speech API in fact uses the binary installed on the local machine. Meaning the API is calling espeak or espeak-ng via speechd anyway.
Have you followed the instructions at web audio api connected to speech api WebAudio/web-audio-api#1764 (comment)? When you plu in the headphones there is no audio output to "speakers". When you check the system sound settings you will see that output is managed by the socket connection.
Agree. The linked code is a workaround. To make the change the source code at browsers - both Mozilla and Chrome, Chromium utilize speechd (speech-dispatcher) - change the parameters set at the socket connection. And there SHOULD be a means to select the output of the speech engine, instead of any and all audio that is potentially being output by the browser. That appears to be what the related issue is describing, that is, for example, by setting the kind of the MediaStreamTrack to "speech" for speech synthesis and speech recognition, for disambiguation.

Kindly compose the specification to do just that so that these workarounds can be retired.

guest271314 · 2019-10-07T17:25:01Z

@cwilso BTW the maintainers of speechd are very astute and helpful. Given your pedigree am relatively certain they would assist suggesting the necessary changes that need to be made at the source code. The specification part is straightforward: provide the option to pipe audio output to a MediaStream.

cwilso · 2019-10-07T17:38:01Z

I think we're talking at cross purposes. If you think Web Speech should be built differently, dive in to that discussion - I think you're fundamentally saying "Web Speech shouldn't exist, we can already do this with speechd" - but is that really true across different OSes and systems? I think it would be good to have one, relatively simply API to do TTS. I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

"Plug in some headphones to avoid audio output" is not a usable expectation for users (just like "install a loopback driver" is not a realistic expectation either, for a bunch of scenarios people have asked for in Web Audio). I'm not entirely sure what your implication is here, because I'm saying precisely this - we need the ability to pipe the stream of audio data from a speech utterance to a Media Stream. Doing that through a Web Socket connection set up to speechd with required client code running in the UI thread creating buffersource nodes and decoding and start(0)'ing audio files as they come in seems like a roundabout way of doing this.

guest271314 · 2019-10-07T17:48:15Z

At *nix neither Chrome, Chromium nor Mozilla Firefox, Nightly implementations write their own "speech synthesis engines", no speech syntheis engine in included in the source code (Windows and Mac may be different here). Kindly read the above-linked SO question carefully while cross-referencing the source code of the respective browsers.

The implementations of Web Speech API at the former browsers rely entirely on there being "speech synthesis engines" already installed on the local machine which speech-dispatcher executes.

That means that when there is no speech synthesis engine installed locally Web Speech API alone does not perform any speech synthesis. AFAICT the specification does not currently mandate that either speech synthesis nor speech recognition MUST be performed locally. Web Speech API is not a speech engine itself.

The same is true for speech recognition, perhaps save for Android and iOS handheld devices.

To change which "speech synthesis engines" are used you can execute spd-conf to select Mary, Flite, espeak (usually shipped by default at *nix distributions) etc., whatever speech synthesis engines are installed locally.

I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

Agree, that should be an option, or as you appear to suggest, default.

"Plug in some headphones to avoid audio output"

That was merely stated to verify that what is being recorded is not the microphone, but rather, audio output. If you open sound setting at *nix while speak() is being called you can obsserve that. If you close Chromium you might even observe that the socket is still open!

Yes, a rooundabout way, though very possible. Native Messaging can also be utilized, to avoid having to interface with Web Speech API at all, as very little significant changes have been made since the specification was published. A WebSocket allows direct communication at any origin, e.g., at console and/or as a Snippet that can be run at any page.

A WebSocket approach which can be used to pipe output from calling the locally installed speech synthesis binary to a MediaStreamTrack https://medium.com/@martin.sikora/node-js-websocket-simple-chat-tutorial-2def3a841b61

Native Messaging requires loading the code at chrome: protocol.

If you are trying to actually fix the specification, write the words that will do just that.

If you are trying to achieve the requirement in spite of the current specification, options are available, e.g. at the front-end you can use meSpeak.js https://stackoverflow.com/questions/38727696/generate-audio-file-with-w3c-web-speech-api.

If you are trying to do both you can achieve the expected result while composing the PR.

guest271314 · 2019-10-07T17:56:47Z

You do not have to create a buffer source. You can connect the live captured MediaStreamTrack to a media stream destination and/or AudioWorkletNode. Again, that code was for demonstration purposes only.

Pehrsons · 2019-10-08T08:36:10Z

I'll just note it'd be more fitting to have a MediaStreamTrack be the output rather than a MediaStream, unless there are requirements that the stream must be exposed early on, and output (tracks) must come and go throughout its lifetime.

It seems to me that speak(utterance) could return a MediaStreamTrack. Or some variant on that to maintain backwards compatibility.

cwilso · 2019-10-09T18:42:21Z

@Pehrsons you're right, it would probably be more fitting to use MediaStreamTrack.

For the use cases I listed, I think it would make more sense to have a more long-lasting MediaStreamTrack than a single Utterance, and also it's critically important to NOT send that output to the main audio output as well. (E.g. this should be maybe an optional parameter to speak(), or a mode you set up via SpeechSynthesis.getMediaStreamTrack() (and release somehow when you're done)). Creating and destroying MediaStreamTracks for every utterance would seem to be costly and prone to causing audio artifacting.

cwilso · 2019-10-09T18:43:19Z

(Changed title to reflect @Pehrsons' suggestion.)

Pehrsons · 2019-10-10T10:05:14Z

I think a long-lived MediaStreamTrack is fine as long as there is something in SpeechSynthesis making it end eventually (i.e., so garbage collection of SpeechSynthesis cannot be observed through the track's ended event).

That said,

As an implementer of mediacapture APIs in Firefox I don't think having multiple tracks is prone to cause audio artifacting. If there was, that would be a bad implementation of playback of MediaStreamTracks.

Whether creating and destroying tracks for every utterance is costly depends on perspective I guess. How long would an utterance be? Are we talking one track per ten seconds or hundreds per second? I assume the former. Garbage collecting lots of objects can be noticable, but "lots" might have to be fairly high for that, even for a mobile device. Note: this is anecdotal, I don't have data to back it up.

Let's also not forget what performance impact a muted MediaStreamTrack might have. In Firefox it means we keep an audio stream open towards the OS because the track can become unmuted at any time (in other MediaStreamTrack APIs muted tends to be shortlived, i.e., it will be unmuted as soon as the connection is set up, the decoder has finished seeking, etc.). If there's a reference to (an idle) SpeechSynthesis object keeping the muted track alive, that might cause quite the power drain.

guest271314 · 2019-10-11T01:55:11Z

Until a speech synthesis engine is shipped with the browser and provides a means to get a MediaStreamTrack from the speech synthesis engine the following approach can be utilized at a Native Messaging host or using a WebSocket. At Chromium a WebSocket connection to the local file system allows client code to be saved in Sources => Snippets and run from any origin (e.g., chrome-search://local-ntp) by right-clicking the name of the snippet and then selecting Run. Native Messaging requires the code to be loaded as an extenstion and run as an "app" at the extenstion URL.

At Chrome or Chromium open DevTools, select Source, then select Snippets, click New snippet, then write the code in the center window, give the snippet a name, e.g., "ws-speak-mst"

const connection = new WebSocket("ws://127.0.0.1:8080", "echo-protocol");
connection.onmessage = async message => {
 try {            
   // message.data is a data URL
   const response = await (await fetch(message.data)).arrayBuffer();
   const ac = new AudioContext();
   const destination = ac.createMediaStreamDestination();
   const ab = await ac.decodeAudioData(response);
   const source = ac.createBufferSource();
   source.buffer = ab;
   source.connect(destination);
   source.connect(ac.destination);
   // MediaStreamTrack with media source being output from espeak-ng
   const [track] = destination.stream.getAudioTracks();
   // just to verify track is the outputting only the TTS audio
   const recorder = new MediaRecorder(new MediaStream([track]));
   recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
   source.start();
   recorder.start();
   // stop() track
   source.onended = _ => (track.stop(), track.enabled = false, recorder.stop());
   } catch (e) {
     console.error(e);
   }
}
// usage -w option writes WAV file instead of outputting audio to speakers
connection.send("espeak-ng -w speak.wav 'testing media stream track from espeak-ng'")

the local code can be PHP, Python, bash, or other preferred programming language. In general, the same code can be run using WebSocket, a local server, or Native Messaging. For this example nodejs is used, in pertinent part

let connection = request.accept("echo-protocol", request.origin);
connection.on("message", message => {
  require("child_process")
  // message.utf8Data: "espeak-ng -w speak.wav 'testing media stream track from espeak-ng'"
  .exec(message.utf8Data, (err, _, stderr) => {
    require("child_process")
    // convert .wav to .ogg with Vorbis codec (playable at Chromium, Firefox)
    // use FFmpeg, speex, etc. to convert WAV to the required codec, container  
    // send a data URL to the browser
    .exec("oggenc speak.wav -o speak.ogg && base64 speak.ogg", (err, stdout, stderr) => {
      connection.send(`data:audio/ogg;base64,${stdout}`);
    })
  })
})

guest271314 · 2019-10-12T01:11:20Z

@cwilso This should capture only audio ouput, precisely when speechSynthesis.speak() is executed, without capturing any microphone input.

(async() => {
  const sink = document.createElement("video");
  document.body.appendChild(sink);
  sink.controls = sink.autoplay = true;
  navigator.mediaDevices.ondevicechange = e => console.log(e);
  const devices = await navigator.mediaDevices.enumerateDevices();
  const {
    deviceId
  } = devices.find(({
    kind, label
  }) => kind === "audiooutput");
  console.log(devices);
  let stream = await navigator.mediaDevices.getUserMedia({
    audio: {
      deviceId: {
        exact: deviceId
      }
    }
  });
  sink.srcObject = stream;
  console.log(devices, deviceId);
  const text = [...Array(10).keys()].join(" ");
  const handleVoicesChanged = async e => {
    const voice = speechSynthesis.getVoices().find(({
      name
    }) => name.includes("English"));
    const utterance = new SpeechSynthesisUtterance(text);
    utterance.voice = voice;
    utterance.pitch = 0.33;
    utterance.rate = 0.1;
    const recorder = new MediaRecorder(stream);
    recorder.start();
    speechSynthesis.speak(utterance);
    recorder.ondataavailable = async({
      data
    }) => {
      console.log(URL.createObjectURL(data));
    }
    utterance.onend = e => (recorder.stop(), stream.getAudioTracks()[0].stop());
  }
  speechSynthesis.onvoiceschanged = handleVoicesChanged;
  let voices = speechSynthesis.getVoices();
  if (voices.length) {
    handleVoicesChanged();
    console.log(voices);
  }

})().catch(console.error);

Firefox throws an OverConstrained error when exact is used. However, Firefox does list

Monitor of Built-in Audio Analog Stereo

Pehrsons mentioned this issue Oct 9, 2019

Support SpeechRecognition input from audio files and Float32Array and ArrayBuffer #70

Open

cwilso changed the title ~~Support SpeechSynthesis *to* a MediaStream~~ Support SpeechSynthesis *to* a MediaStreamTrack Oct 9, 2019

chrisguttandin mentioned this issue Nov 4, 2024

doubt regarding weather it can be used to modify the webkitspeech chrisguttandin/standardized-audio-context#1015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support SpeechSynthesis to a MediaStreamTrack #69

Support SpeechSynthesis to a MediaStreamTrack #69

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019

guest271314 commented Oct 7, 2019

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019 •

edited

Loading

guest271314 commented Oct 7, 2019

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019 •

edited

Loading

guest271314 commented Oct 7, 2019

Pehrsons commented Oct 8, 2019

cwilso commented Oct 9, 2019

cwilso commented Oct 9, 2019

Pehrsons commented Oct 10, 2019

guest271314 commented Oct 11, 2019

guest271314 commented Oct 12, 2019 •

edited

Loading

Support SpeechSynthesis *to* a MediaStreamTrack #69

Support SpeechSynthesis *to* a MediaStreamTrack #69

Comments

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019

guest271314 commented Oct 7, 2019

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019 • edited Loading

guest271314 commented Oct 7, 2019

cwilso commented Oct 7, 2019

guest271314 commented Oct 7, 2019 • edited Loading

guest271314 commented Oct 7, 2019

Pehrsons commented Oct 8, 2019

cwilso commented Oct 9, 2019

cwilso commented Oct 9, 2019

Pehrsons commented Oct 10, 2019

guest271314 commented Oct 11, 2019

guest271314 commented Oct 12, 2019 • edited Loading

Support SpeechSynthesis to a MediaStreamTrack #69

Support SpeechSynthesis to a MediaStreamTrack #69

guest271314 commented Oct 7, 2019 •

edited

Loading

guest271314 commented Oct 7, 2019 •

edited

Loading

guest271314 commented Oct 12, 2019 •

edited

Loading