Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Either fully support or remove audio capture entirely: "MAY" re audio capture is ambiguous #140

Closed
guest271314 opened this issue May 9, 2020 · 18 comments

Comments

@guest271314
Copy link
Contributor

This entire paragraph

In the case of audio, the user agent MAY present the end-user with audio sources to share. Which choices are available to choose from is up to the user agent, and the audio source(s) are not necessarily the same as the video source(s). An audio source may be a particular application, window, browser, the entire system audio or any combination thereof. Unlike mediadevices.getUserMedia() with regards to audio+video, the user agent is allowed not to return audio even if the audio constraint is present. If the user agent knows no audio will be shared for the lifetime of the stream it MUST NOT include an audio track in the resulting stream. The user agent MAY accept a request for audio and video by only returning a video track in the resulting stream, or it MAY accept the request by returning both an audio track and a video track in the resulting stream. The user agent MUST reject audio-only requests.

is ambiguous, capable of producing confusion for both implementers and fron-end users, see w3c/mediacapture-screen-share-extensions#12.

For *nix users with PulseAudio installed the platform supports system wide audio capture at device named "Monitor of ", see https://bugs.chromium.org/p/chromium/issues/detail?id=1032815.

Right now, *nix users with PulseAudio installed have the technical capability to capture system-wide audio, yet due to lack of clear and affirmative language in the specification implementers are obviously not encouraged or compelled to make sure that audio capture is implemented to conform with the specification: no mandatory language tells them to do so.

Either change the language to state the user agent MUST capture audio when the constraint is passed where supported at the architecture and platform; or, if audio capture is really not intended for this API, remove all audio language from the specification completely.

@guest271314
Copy link
Contributor Author

For example, a change to unambiguously support audio capture using getDisplayMedia() which is possible at *nix through PulseAudio interface

"If the platform supports system-wide or specific application audio capture for which permissions have been granted to capture the user agent MUST capture audio output from that device when the {audio: true} constraint is passed"

@guest271314
Copy link
Contributor Author

If the user agent knows no audio will be shared for the lifetime of the stream

How could the user agent possibly know that when getDisplayMedia() is executed where {audio: true} and MediaStream has an addTrack() method?

@guest271314
Copy link
Contributor Author

Consider playing a video at mpv, getDisplayMedia({video: true, audio: true}) executed after the video is already playing, video is output to the application screen and audio is output to headphones if plugged in or speakers. The expected result is for both video (screen) and audio to be captured. The platform, *nix with PulseAudio managing audio devices and output, supports audio capture.

The fact that audio is being output at the system before getDisplayMedia({video: true, audio: true}) nullifies the criteria

If the user agent knows no audio will be shared for the lifetime of the stream it MUST NOT include an audio track in the resulting stream.

which appears to be impossible to determine - how does the user agent "know" anything about the entire lifetime of the stream, particularly when the constraint {audio: true} is included as constraint to the function? Unless the constraint is ignored and inapplicable in any case as determined by the user agent, not the technical capabilities of of the platform?

Why would audio not be captured in the case of a local media player application playing a video with audio track of the resource being output, if in fact the {audio: true} constraint is actually being processed through an algorithm by the user agent (to comport with the implementation of the specification) to verify that the application window being captured, and as in this case, system audio to headphones or speakers, is actually outputting audio, and because the system supports capturing the device, MUST be captured because the physical user granted the user agent permission to do so?

"Who" is responsible for not capturing audio when the system supports such capture; the specification, the implementation?

@aboba
Copy link
Contributor

aboba commented May 9, 2020

AFAIK, only system-wide audio capture has been implemented (in Chromium), but audio capture from applications isn't precluded. Since this isn't a widely implemented feature, getting consensus to make it mandatory could be hard. On the other hand, it is used, so getting consensus to remove it is also unlikely. So it's optional, in the absence of further implementation progress.

@guest271314
Copy link
Contributor Author

@aboba

AFAIK, only system-wide audio capture has been implemented (in Chromium)

Not by default at *nix.

It is not possible to get system wide audio output at Chromium at *nix without setting Monitor of <device> either at PulseAudio sound settings GUI or ~/.asoundrc (untested).

Firefox does list Monitor of <device> at getUserMedia(), see web-platform-tests/wpt#23084, linked to at w3c/mediacapture-screen-share-extensions#12.

Consider

let constraints = new Map([
    ['getDisplayMedia', { audio: true, video: true }],
    ['getUserMedia', { audio: true }],
  ]),
  audioTrack,
  videoTrack,
  recorder;
navigator.mediaDevices.ondevicechange = e => console.log(e);
navigator.mediaDevices
  .getDisplayMedia(constraints.get('getDisplayMedia'))
  .then(async stream => {
    try {
      let audioStream;
      console.log(stream.getTracks());
      [videoTrack] = stream.getVideoTracks();
      if (
        constraints.get('getDisplayMedia').video &&
        stream.getAudioTracks().length === 0
      ) {
        const devices = await navigator.mediaDevices.enumerateDevices();
        console.log(devices);
        const audioDevice = devices.find(
          ({ kind, label }) =>
            kind === 'audiooutput' || (label && label.includes('Monitor'))
        );
        if (audioDevice) {
          constraints.get('getUserMedia').audio = {
            audio: { exact: { deviceId: audioDevice.groupId } },
          };
          audioStream = await navigator.mediaDevices.getUserMedia(
            constraints.get('getUserMedia')
          );
          [audioTrack] = audioStream.getAudioTracks();
          if (audioTrack) {
            stream.addTrack(audioTrack);
            recorder = new MediaRecorder(stream);
            recorder.ondataavailable = e => {
              console.log(URL.createObjectURL(e.data));
              audioTrack.stop();
              videoTrack.stop();
            };
            recorder.start();
            setTimeout(() => {
              recorder.stop();
            }, 10000);
          }
        }
      }
    } catch (e) {
      console.error(e);
    }
  });

unless the user physically selects Monitor of <device> outside of the browser, specifically, only during recording of initial MediaStream using MediaRecorder as a workaround, to persist that device now being set as default whenever getUserMedia() is called after that, unless changed at GUI or native code, the audio track will always be microphone input, not system wide or application specific audio output, because after the recording has stopped the user no longer has that option.

The fix here is very simple. Either remove MAY language to not compel implementers to be in conformance with this specification re audio capture at all, or substitute MUST for MAY to compel implementers to allow selection of Montitor of <device> at getUserMedia() and getDisplayMedia() prompt (and constraints) at Chromium.

If you are aware of a canonical means to select system wide audio output at Chromium/Chrome other than the procedure described above, kindly share the complete procedure here. As it stands, after experimenting and testing various approaches, it appears to be impossible to capture audio output at Chromium using an official API.
Screenshot_2020-05-09_20-03-05
Screenshot_2020-05-09_20-04-00

@guest271314
Copy link
Contributor Author

At *nix, by default Screenshot_2020-05-09_20-16-35, Chromium only provides access to 'Default' audio device, even if labelled 'audiooutput' the track will still be from microphone

Right now the specification is not clear at all re audio capture, thus, there is no language in the specification which can point to at an implementer issue which conveys unambiguous language; the implementer, if the issue is not closed, could simply point to MAY language, if at all, and let the issue sit. How can a user cite the current iteration of this specification as authority for audio capture when MAY is used?

@guest271314
Copy link
Contributor Author

This comment attests to system wide audio capture being possible at *indows OS by default w3c/mediacapture-main#694 (comment). *nix has the technical capability to capture system wide audio output, from PulseAudio/Examples - Arch Wiki https://wiki.archlinux.org/index.php/PulseAudio/Examples#ALSA_monitor_source

ALSA monitor source
To be able to record from a monitor source (a.k.a. "What-U-Hear", "Stereo Mix"), use pactl list to find out the name of the source in PulseAudio (e.g. alsa_output.pci-0000_00_1b.0.analog-stereo.monitor). Then add lines like the following to /etc/asound.conf or ~/.asoundrc:

pcm.pulse_monitor {
  type pulse
  device alsa_output.pci-0000_00_1b.0.analog-stereo.monitor
}

ctl.pulse_monitor {
  type pulse
  device alsa_output.pci-0000_00_1b.0.analog-stereo.monitor
}

Now you can select pulse_monitor as a recording source.

Alternatively, you can use pavucontrol to do this: make sure you have set up the display to "All input devices", then select "Monitor of [your sound card]" as the recording source.

however, there is no official algorithm or constraint described demonstrating the canonical procedure to do so.

Firefox 75 and Nightly 78 list Monitor of <device> at getUserMedia() prompt, Chromium does not.

@guest271314
Copy link
Contributor Author

@aboba Browsing Chromium issues in the wild for getDisplayMedia() and audio see the expected result at this bug https://bugs.chromium.org/p/chromium/issues/detail?id=1074529

What is the expected result?
A reference to a capture should be returned that includes audio- and video-tracks.

@jan-ivar
Copy link
Member

The API is not ambiguous: it clearly states that applications may not rely on audio being returned. This means an app cannot force a user to share audio, which was intentional.

The use case we had consensus to include was complementary audio for screen-sharing, at a user's discretion. Audio-only capture was deemed out of scope.

@guest271314
Copy link
Contributor Author

@jan-ivar

The API is not ambiguous: it clearly states that applications may not rely on audio being returned. This means an app cannot force a user to share audio, which was intentional.

The goal is not to force a user to capture audio.

The goal for disambiguation is to allow a user to capture audio.

The use case we had consensus to include was complementary audio for screen-sharing, at a user's discretion. Audio-only capture was deemed out of scope.

That use case is not possible at Firefox or Chromium without using getUserMedia(). At Chromium that use case is not possible at all using only the browser. The user must first record a MediaStream and set the device to Monitor of <device> during the recording to change the device, due to Chromium only supplying a specific class of device to capture, that is, internal microphone.

The use case that you describe as being consensus is no possible right now. Consider

$ mpv blade_runner.webm

then

navigator.mediaDevices.getDisplayMedia({video: true, audio: true})

at a browser, which satisfies

An audio source may be a particular application, window, browser, the entire system audio or any combination thereof.

and nullifies

If the user agent knows no audio will be shared for the lifetime of the stream it MUST NOT include an audio track in the resulting stream.

However, given the use case consensus agreed upon, if file issues at Chromium and Firefox right now, the implementers could simply say "we don't want to capture audio, for no reason" and the specification cannot be cited as a primary source for the requirement to capture both vide and the audio output by mpv or similar media playback application, thus rendering the "MAY" effectively a "DON'T HAVE TO IF WE DON'T WANT TO, FOR NO REASON", which is ambiguous, or, essentially, "WILL NOT", making any mention of audio capture in the specification as to having teeth moot.

@guest271314
Copy link
Contributor Author

If is not clear how

If the user agent knows no audio will be shared for the lifetime of the stream it MUST NOT include an audio track in the resulting stream.

can possibly be determined?

How can the user agent possibly know if "no audio will be shared for the lifetime of the stream"?

What is the algorithm to determine that initial state and the real-time state of the MediaStream?

If the application being captured is, for example, from mpv, where the media comprises both video and audio then audio is being shared at the inception of the lifetime of the stream.

Yet, given the specification and implementations at Firefox and Chromium at *nix it is impossible to capture that shared audio using getDisplayMedia(), again, rendering mentioning audio in the specification at least ambiguous and inconsistent, perhaps unless *indows is being used which given rise to web compatibility issues https://bugs.chromium.org/p/chromium/issues/detail?id=1074529#c3
Screenshot_2020-05-16_14-24-03
.

@guest271314
Copy link
Contributor Author

Solution: Leave it up to the user not the user agent to decide whether or not

no audio will be shared for the lifetime of the stream

which cannot rationally be determined by any algorithm run by any user agent where addTrack() is a method of MediaStream: the user can add or remove one or more audio tracks to the MediaStream from getDisplayMedia() any time they choose.

A prototype example is already implemented at *indows: A simple checkbox at getDisplayMedia() prompt "Share audio".

Individual browsers' behaviour is not infrequently mentioned at media capture main as an implementation to model other browser behviour on or not. In this case *indows just let's the user decide to capture ("Share") audio or not. Then "MAY" is in the hands of the user, not a user agent that cannot "know" anything about the lifetime of a MediaStream.

@guest271314
Copy link
Contributor Author

The term of art "MAY" is precisely ambiguous, capable of more than one interpretation and application, as evidenced by *indows implementation, in coding parlance, "flaky". Not taking one side or the other. The codified rules of statutory construction can be used to determine what the words meant when decreed. There are other similar terms of art all too often used when a legislative body ran out of time, got lazy, or wanted the ability for the other branches to apply the rule ambiguously on purpose unbeknownst to the unintitated, typically whom the rule applies to, not for, in this case "MAY" means to implementer (consumer of the language) disregard capturing audio if you want.

A user at *nix, reading

An audio source may be a particular application, window, browser, the entire system audio or any combination thereof.

takes that literally. The user has the requirement to do just that, "a particular application", "the entire system audio or any combination thereof." whether mpv or PictureInPictureWindow, etc.

There is a "MAY". That means there is potential for implementation, which is not strictly precluded. An ambiguous term. YMMV depending on whom is interpreting that word. Could land on either side, or both, depending on arbiter and an individuals' willingness or lack thereof to accept fuzzy logic outcomes for static input value.

The system user agent (however that term is defined; browser; OS; machine) technically supports capturing any of the listed items.

If file at implementers (browsers) they can cite "MAY", file here and here

The use case we had consensus to include was complementary audio for screen-sharing, at a user's discretion.

Yes, exercising that discretion in the affirmative by filing this issue. How to achieve that use case at Firefox and Chromium at Linux?

Or, in the case of Firefox and Chromium at Linux does "MAY" really mean "Not Implemented"?

@guest271314
Copy link
Contributor Author

guest271314 commented May 19, 2020

@aboba @jan-ivar

Revisited and tested the concepts at https://gist.github.com/guest271314/59406ad47a622d19b26f8a8c1e1bdfd5 several hundred times and have a working example (prototype) of starting and stopping capture of system audio output ("What U Hear") at Linux, tentatively captureSystemAudio() and stopSystemAudioCapture(). The code is not voluminous, cobbled together from existing questions and answers at large on the web, tested by starting a filesystem monitor, accessing one file to start system capture and another file to stop capture.

At the current working version the captured audio is piped through opusenc to a local file, though ideally we can pipe the stream directly to a WebRTC MediaStream (MediaStreamTrack) to the caller without having to store a local file, untested thus far as don't have native WebRTC installed, yet.

If in fact system audio capture is not intended to be part and parcel of this specification, in spite of the language

An audio source may be a particular application, window, browser, the entire system audio or any combination thereof.

and

Since this isn't a widely implemented feature, getting consensus to make it mandatory could be hard.

what is the best way to proceed to get the process/algorithm specified?

Or, the prevailing consensus is that application and system audio capture a matter of implementer discretion and interest - that is what "MAY" is intended to mean in specification, and "abandon all hope" of getting this formally specified in a form implementers are used to (e.g., W3C template, group, etc.;) and just publish the procedure and code at a GitHub repository - and close this issue?

Thanks, /guest271314/

@guest271314
Copy link
Contributor Author

@jan-ivar
Copy link
Member

jan-ivar commented Oct 8, 2020

Closing based on #140 (comment)

@jan-ivar jan-ivar closed this as completed Oct 8, 2020
@guest271314
Copy link
Contributor Author

Closing based on #140 (comment)

@jan-ivar Can you provide your precise rationale for closing? File issues at implementers?

@bradisbell
Copy link

@jan-ivar I just saw this comment you made a few months back:

Audio-only capture was deemed out of scope.

Can you elaborate on why that was the case, and whether or not audio capture of other applications is still out-of-scope?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants