Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Media Capture and Streams with MediaStreamTrack kind TTS #654

Closed
guest271314 opened this issue Jan 9, 2020 · 4 comments
Closed

Comments

@guest271314
Copy link

Extending Media Capture and Streams with MediaStreamTrack kind TTS or: What is the canonical procedure to programmatically create a virtual media device where the source is a local file or piped output from a local application?

The current iteration of the specification includes the language

https://w3c.github.io/mediacapture-main/#dfn-source

source
A source is the "thing" providing the source of a media stream track. The source is the broadcaster of the media itself. A source can be a physical webcam, microphone, local video or audio file from the user's hard drive, network resource, or static image.

and

https://w3c.github.io/mediacapture-main/#extensibility

Extensibility

In pertinent part

The purpose of this section is to provide guidance to creators of such extensions.

and

https://w3c.github.io/mediacapture-main/#defining-a-new-media-type-beyond-the-existing-audio-and-video-types

16.1 Defining a new media type (beyond the existing Audio and Video types)

The list items under the above section are incorporated by reference herein.

Problem

Web Speech API (W3C) is dead.

The model is based on communication with speech-dispatcher to output audio from the sound card, where the user (consumer) has no control over the output.

This proposal is simple:

Extend MediaStreamTrack to include a kind TTS where the source is output from a local TTS (Text To Speech; speech synthesis) engine.

The model is also simple: Assuming there is a local .txt or .xml document the input text is read by the TTS application from the local file. The outpiut is a MediaStream containing a single MediaStreamTrack of type and label TTS.

The source file is read and output as a MediaStreamTrack within a MediaStream after getUserMedia() prompt.

When the read of the file reaches EOF the MediaStreamTrack of kind TTS automatically stops, similar to MediaRecorder: Implements spontaneous stopping.

Such functionality exists for testing, in brief

For example

# launch
chromium-browser --allow-file-access-from-files --autoplay-policy=no-user-gesture-required --use-fake-device-for-media-stream --use-fake-ui-for-media-stream --use-file-for-fake-audio-capture=$HOME/test.wav%noloop --user-data-dir=$HOME/test 'file:///home/user/testUseFileForFakeAudioCaptureChromium.html'

// use at main thread
navigator.mediaDevices.getUserMedia({audio: true})
.then(mediaStream => {
  const ac = new AudioContext();
  const source = ac.createMediaStreamSource(mediaStream);
  source.connect(ac.destination);
});

One problem with using that testing code for production to meet the requirement of outputting resulting of TTS is that there is no way to determine EOF without getting the duration of the file before playback at MediaStream, since SSML can include <break time="5000ms"/> analyzing audio output stream for silence can lead to prematurely executing MediaStreamTrack.stop() in order to end the track.

When called multiple times in succession, even after having statted the file twice to get the duration, after two to three calls no sound is output.

MacOS also has an issue with that flag --use-file-for-fake-audio-capture doesn't work on Chrome.

We can write the input .txt or .xml file to local filesystem using File API or Native File System, therefore input is not an issue.

Why Media Capture and Streams and not Web Speech API?

W3C Web Speech API is dead.

W3C Web Speech API was not initially written to provide such functionality, even though the underlying speech synthesis application installed on the local machine might have such functionality.

Even if Web Speech API does become un-dead to move to, or provide MediaStream and MediaStreamTrack as output options some form of collarboration with and reliance on this governing specification will be required, thus it is reasonable to simply begin from the Media Capture and Stream specification re "Extensibility" and work backwards, or rather, work from both ends towards the middle. Attempting to perform either modification in isolation might prove to be inadequate.

If there is any objection as to W3C Web Speech API being dead, re the suggestion to deal with speech synthesis in Web Speech API specification, then that objection must include the reason why Web Speech API hass not implemented SSML parsing flag when the patch has been available for some time https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18, and why instead of actually using the Web Speech API, ChromiumOS authors decided to use wasm and espeak-ng to implement TTS https://chromium.googlesource.com/chromiumos/third_party/espeak-ng/+/refs/heads/chrome, essentially abandoning Web Speech API usage?

--

An alternative approach to solve the use case is for the specification to compose the formal steps necesary to create a virtual media device that getUserMedia() provides access to as "microphone" (becuase the device is virtual we can assign said device as a microphone which should be listed at getUserMedia() prompt and listed at enumerateDevices().

, e.g., https://stackoverflow.com/a/40783725

    diff --git a/webrtc/modules/audio_device/dummy/file_audio_device.cc b/webrtc/modules/audio_device/dummy/file_audio_device.cc
index 8b3fa5e..2717cda 100644
--- a/webrtc/modules/audio_device/dummy/file_audio_device.cc
+++ b/webrtc/modules/audio_device/dummy/file_audio_device.cc
@@ -35,6 +35,7 @@ FileAudioDevice::FileAudioDevice(const int32_t id,
     _recordingBufferSizeIn10MS(0),
     _recordingFramesIn10MS(0),
     _playoutFramesIn10MS(0),
+    _initialized(false),
     _playing(false),
     _recording(false),
     _lastCallPlayoutMillis(0),
@@ -135,12 +136,13 @@ int32_t FileAudioDevice::InitPlayout() {
       // Update webrtc audio buffer with the selected parameters
       _ptrAudioBuffer->SetPlayoutSampleRate(kPlayoutFixedSampleRate);
       _ptrAudioBuffer->SetPlayoutChannels(kPlayoutNumChannels);
+      _initialized = true;
   }
   return 0;
 }

 bool FileAudioDevice::PlayoutIsInitialized() const {
-  return true;
+  return _initialized;
 }

 int32_t FileAudioDevice::RecordingIsAvailable(bool& available) {
@@ -236,7 +238,7 @@ int32_t FileAudioDevice::StopPlayout() {
 }

 bool FileAudioDevice::Playing() const {
-  return true;
+  return _playing;
 }

 int32_t FileAudioDevice::StartRecording() {
diff --git a/webrtc/modules/audio_device/dummy/file_audio_device.h b/webrtc/modules/audio_device/dummy/file_audio_device.h
index a69b47e..3f3c841 100644
--- a/webrtc/modules/audio_device/dummy/file_audio_device.h
+++ b/webrtc/modules/audio_device/dummy/file_audio_device.h
@@ -185,6 +185,7 @@ class FileAudioDevice : public AudioDeviceGeneric {
   std::unique_ptr<rtc::PlatformThread> _ptrThreadRec;
   std::unique_ptr<rtc::PlatformThread> _ptrThreadPlay;

+  bool _initialized;;
   bool _playing;
   bool _recording;
   uint64_t _lastCallPlayoutMillis;

in order to not have to ask this body to specify the same in the official standard, just patch in the virtual device to the existing infrastructure.

--

Use cases

For some reason, users appear to feel more comfortable using standardized API's rather than rolling their own. For those users a canonical means to patch into the existing formal API without that functionality being offcially written might provide the assurance they seem to want that the means used are appropriate and should "work". Indeed, some users appear to not be aware that currently Web Speech API itself does not provide any algorithm to synthesize text to speech, it is hard to say.

Support SpeechSynthesis to a MediaStreamTrack

It would be very helpful to be able to get a stream of the output of SpeechSynthesis.

For an explicit use cases, I would like to:

  • position speech synthesis in a virtual world in WebXR (using Web Audio's PannerNode)
  • be able to feed speech synthesis output through a WebRTC connection
  • have speech synthesis output be able to be processed through Web Audio
    (This is similar/inverse/matching/related feature to should getSupportedContraints be static? #66.)
    Though they are aware that the output is sqaurely their media - user media - that they should be able to "get".

WICG/speech-api#69 (comment)
I think it would be good to have one, relatively simply API to do TTS. I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

and

Use and parse SSML to change voices, pitch, rate

Hi, thanks for this. I wish it be more widely adopted solution, which everyone can run in their browser without a need to install something in their system. But I understand it's not possible in the moment, so I'll be searching a way to make communicating with WhatWG.

I'll reopen this to help others to read this issue. Please don't close it.

The latter case should be easily solved by implementing SSML parsing. However, that has not been done

even though the patch to o so exists https://bugs.chromium.org/p/chromium/issues/detail?id=795371#c18

---
chrome/browser/speech/tts_linux.cc     \|    1 +
third_party/speech-dispatcher/BUILD.gn \|    1 +
2 files changed, 2 insertions(+)
 
--- a/chrome/browser/speech/tts_linux.cc
+++ b/chrome/browser/speech/tts_linux.cc
@@ -137,6 +137,7 @@ void TtsPlatformImplLinux::Initialize()
libspeechd_loader_.spd_set_notification_on(conn_, SPD_CANCEL);
libspeechd_loader_.spd_set_notification_on(conn_, SPD_PAUSE);
libspeechd_loader_.spd_set_notification_on(conn_, SPD_RESUME);
+  libspeechd_loader_.spd_set_data_mode(conn_, SPD_DATA_SSML);
}
 
TtsPlatformImplLinux::~TtsPlatformImplLinux() {
--- a/third_party/speech-dispatcher/BUILD.gn
+++ b/third_party/speech-dispatcher/BUILD.gn
@@ -19,6 +19,7 @@ generate_library_loader("speech-dispatch
"spd_pause",
"spd_resume",
"spd_set_notification_on",
+    "spd_set_data_mode",
"spd_set_voice_rate",
"spd_set_voice_pitch",
"spd_list_synthesis_voices",

and the maintainers of speech-dispatcher (speechd) are very helpful.

Tired of waiting for Web Speech API to be un-dead, wrote an SSML parser from scratch using JavaScript

So, no,

#629 (comment)
Especially not to work around another web API failing to provide adequate access¹ to audio it generates, to solve a use case that seems reasonable in that spec's domain.

is not applicable anymore. Why would users have any confidence that the Web Speech API is un-dead and will eventually address the issue?

Besides, in order to get output as a MediaStream this specification would needd to be involved in some non-trivial way as a reference.

--

The purpose of this issue is to get clarity on precisely what is needed to

  1. Extend getUserMedia() to list a created virtual device for purposes of speech synthesis output;
  2. If 1. is not going to happen (per Support capturing audio output from sound card #629; Clarify getUserMedia({audio:{deviceId:{exact:<audiooutput_device>}}}) in this specification mandates capability to capture of audio output device - not exclusively microphone input device #650) then kindly clearly write out the canonical steps required to create OS agnostic code to implement the device that getUserMedia() is currently specified to list and have access to, so that we can feed that device the input from file or pipe directly to the MediaStreamTrack, so that users can implement the necessary code properly themselves.

The use cases exist. The technology exists. Am attempting to bridge the gap between an active and well-defined specification and an ostensibly non-active and ill-defined specification, incapable of being "fixed" properly without rewriting the entire specification (which cannot participate in due to the fraudulent 1,000 year ban placed on this user from contributing to WICG/spech-api).

What are the canonical procedures to 1) extend (as defined in this specification) MediaStreamTrack to include a "TTS", kind and label with speech synthesis engine output as source (as defined in this specification); and 2) programmatically create a virtual input device that getUserMedia({audio:true}) will recognize, list, and have access to?

@guest271314
Copy link
Author

Note, am asking this question more for other users and use cases than for self, for individuals who are more comfortable using clearly defined specifications and official implementations by browsers than rolling their own.

Some users appear to actually expect these specifications to meet their needs, and/or, do not want to "install" anything, rather the code is expected to be already implemented in the browser, for whatever their reasons are. Nonetheless, some users still apparently believe that the browser should be able to output what they expect it to, given the state of the art. That is a reasonable expectation that have abandoned for self. Until that option is foreclosed, which the closure of the related issues effectively does. Still, will ask one more time, this occasion, for the canonical procedure to extend MediaStreamTrack and precisely what getUserMedia() expects by way of a "device" that can be listed within the confines of "microphone" input device.

@guest271314
Copy link
Author

Basic model of MediaStreamTrack kind "TTS"

  • Single-use
  • Persistent

Single-use

MediaStreamTrack enabled property set to true when source is not exhausted and when source is exahausted set enabled to false, muted to true, readyStateto"ended"`.

Persistent

MediaStreamTrack enabled property is set to true when output is not silence, otherwise set enabled property to false and muted property to true. Track does not end until stop() is executed or track becomes muted.

@alvestrand
Copy link
Contributor

Production of a MediaStreamTrack from TTS should be an extension spec for the Text-To-Speech API, not a feature of the MediaStreamTrack API.

@guest271314
Copy link
Author

@alvestrand Can you answer this supplemental question

kindly clearly write out the canonical steps required to create OS agnostic code to implement the device that getUserMedia() is currently specified to list and have access to

where the scope is beyond only a TTS use case?

Since Chromium, Chrome already has source code to create a "Fake" device, read a WAV file and output that audio as a MediaStream that is accessed by getUserMedia() can you detail the canonical, language agnostic pattern necessary to programmatically create a device that getUserMedia() is currently mandated by the specification to list as a device?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants