-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A CaptureController object for getDisplayMedia() #230
Comments
I do not see much value in reusing controller objects.
Seems fine given above caveat.
This is UA territory, I am not sure the current spec says anything about this.
I am not a big fan of a timer here. Maybe we should start with something restrictive and relax to a timer approach if we see good use cases for it. Or we could require an explicit decision from the capturer and delay video frame delivery until the decision is made.
This seems fine. That means getDisplayMedia promise resolves first, then focus promise. Some additional thoughts:
|
It is also worth pointing out that not focusing increases the risk of mis-selection by the user, so particular care should be brought there by UA to limit this risk. |
To clarify the tone of this message - I think we're on a similar page overall. I am supportive of the controller approach as a good compromise, I want to use it to reach consensus on Conditional Focus, and I'd like us to make rapid progress over the shape. My comments below are to that effect.
Would this be
Do we really want to inform the capturer that the capturee has gained focus? The capturing application can observe that it has, itself, lost focus. But it can't tell if the user has Maybe you intention here is to convey an error message back to the application when focus() fails? Would this be actionable and useful enough to merit the complexity and privacy-concerns entailed?
On the one hand, I really like how this requires an explicit decision from new code while still retaining the established UA behavior old code. Well done there! On the other hand, I am concerned about specifying defaults. We have two cases to discuss defaults for:
I think we should start out leaving the defaults here up to the UA, and circle back to this if we manage to get consensus. This is reasonable because:
If I am reading @youennf correctly here, then we're in agreement; see above.
The timer is a necessary evil, to remove the sting from this hypothetical malicious code: await ...getDisplayMedia({video: ..., controller: controller});
busyWait(/*seconds=*/5);
controler.setFocusBehavior("focus-captured-surface"); Note that it doesn't have to be malicious. If a CPU hiccup or buggy code happens to cause the shared surface to be focused at an arbitrary, unexpected point in the future, it will surprise and annoy the user. It might also produce an unintended interaction with the focused surface, that might have been intended for another page. Speaking of which - we should also require that focus() be rendered no-op if the capturing page is not the focused one by the time focus() is evaluated. I believe my current draft of Conditional Focus does that, but otherwise, we could change it.
I agree that there is unclarity here. Chrome has gone with the following, and I think it's a good place to start: // Previously on the track itself, now on the controller:
undefined focus(CaptureStartFocusBehavior focus_behavior);
enum CaptureStartFocusBehavior {
"focus-captured-surface",
"no-focus-change"
}; I'd just change
I'm OK with that.
There are two risks here:
Chrome's approach to this problem was to reduce the first risk. We did that by introducing a preview for tabs (we'd already had them for windows and screens). With that in place, the second risk has become far less relevant. |
Glad to hear support for the controller model! Totally agree that in absence of the controller a spec note seems fine.
I like it. And synchronous exceptions aren't possible from async API methods, so we don't have to worry about those.
I feel we rabbit-holed in #190 and failed to find anything more restrictive that didn't trade one problem for another. But from some of the comments I worry I've undersold the model shift when a controller is present: The focus decision is untied from and no longer orchestrated by getDisplayMedia resolution. This gets us away from the microsecond timescale and questions of defaults. Instead, focus is explicitly executed by a separate Defaults, settable states, and media output blocking seem a poor fit in this model. If we're to go with a method model then we should lean into its advantages of separation. Otherwise we should probably go with a different model.
In my mind, a captureController controls the capture so captureController.focus() focuses the capture, which seems clear to me, and I like short names, but we can always bikeshed.
I think this would defeat use cases such as screen recording. My working assumption as presented has been that screen-capture isn't solely a tool for video conference presentations, and that the current auto-focusing done by browsers amounts to a layer violation. Any perceived privacy guarantees from existing browser behavior here probably deserves to be challenged. I'm happy to beef up privacy indicator requirements, but we should probably discuss that in #211 to avoid this getting long.
Privacy concerns seem no different to me from today where an app knows that if the resulting The method model does offer us a way to convey failure to focus for whatever reason. Might that be of interest to video conferencing sites? |
I also agree that previews are the solution to mis-selection. |
Yes, both work since video defaults to true. We're missing a WPT test for that though... |
It seems we have two paths here.
AIUI, current proposal is close to 2 but with the constraint that focus() is called only once around the time getDisplayMedia is called/resolved.
My understanding was that it is difficult to predict what VC apps want.
This is not sufficient. An attacker can make two pages looking exactly the same (previews will not help there) and get to a point where capturer is capturing a surface that it can navigate while continuing to capture. Unfocusing makes this attack a little bit easier (attacker will no longer need to trick user in unfocusing the capturee). |
That's the version I'm after, and for that reason, I want an explicit decision -
Vendors of video conferencing software have requested this functionality (this = focus-capturee at an arbitrary time). At the moment, I don't think we can accommodate them, because the risks outweighing the rewards. (Note that user activation is insufficient - tricking the user to double-click at arbitrary coordinates on the capturer then becomes equivalent to tricking them into single-clicking at the respective coordinates of the captured surface.) Focus at an arbitrary moment is out of scope for me at this time.
That's my situation too. There's significant overhead to adding this functionality (this = Promises that resolves when focus-change happens).
A use case has to motivate all of this extra work. I have thought about it and could not find it.
I don't think it's actionable enough to justify the costs. To hedge our bet, if this changes in the future, we can fix our mistake by exposing a
I don't understand why. Your example attack consists of capturing the wrong tab, but one which looks identical to the right one. How is focusing it making it less identical? How does it defang the attack? Focus seems orthogonal here. I propose the following consolidation:
If we can agree on this, I can prepare the PR. |
Can you elaborate on the risks? |
Please see the text beginning with "tricking the user to double-click..." and ending with "...of the captured surface" in this message.
|
Your past message says that tricking a user to click will allow the web page to focus to the capturee page.
Why is not focusing the less risky behaviour? |
I am not sure where the unclarity lies. Two options have been discussed:
I have explained the dangers inherent to 2, and explained that therefore, only 1 is in scope for me. Namely, if we were to implement 2, an attacker could cause the user to click inside the captured page, at a location of the attacker's choice. This risk does NOT exist with 1 to any credible extent. I hope this answers the question?
Because the user keeps seeing the same page they had manually focused before, and with which they had interacted, and which would still be the active page if not for the getDisplayMedia() call.
The user DID see a preview, as previously discussed. If you want to mandate a preview in the spec, we can do that.
It's the existing behavior for all supporting browsers, but I am willing to specify differently if that helps drive us reach consensus. |
Let's see if we can nail down the controller pattern first. Folks who have spoken up seem in agreement that:
Great! Maybe we can nail down this one next, since it seems unrelated to timing:
This removes implementation defined layer-violating behavior which is bad for tests, web developers, users, and screen recording:
Yes, and allowing user agents to override a page not wanting focus would defeat that, wouldn't it? I think a user agent override here would require a strong end-user benefiting reason. Regarding "mis-selection":
I should have said: I don't think focus is the solution to mis-selection. It's the solution to VC presentation, and comes with its own risks like clickjacking. Mitigating one risk with another seems like an unnecessary bargain and a poor strategy. Do we agree that supporting screen recording by letting users pick a displaySurface that isn't brought to front, is a requirement? If we don't have our requirements down, it's probably premature to discuss API. I do agree with the assessment that "not focusing increases the risk of mis-selection" and that UA's should "limit this risk", but I think user agents need to solve this without focus in #211. |
I definitely agree, and it seems to me that you agree too, @jan-ivar. We have discussed this API at #190 over multiple months and several interim meetings. We have all been discussing shape in-depth for a long time now, so if anyone were to call the entire thing into question at this stage, I'd be very curious to hear what had changed their mind. I see points 1-8 in my message as sufficient for an MVP. Could everyone involved please mention:
|
I agree, but with my chair hat on: we have a process where urgency is for the chairs to manage outside of github so we can focus on receiving input on the topic here. chair hat off. |
@jan-ivar, I'm assuming your last comment pertains only to the lines it quoted. So wrt to this part:
What is your answer? |
(Btw, I am not sure what urgency is discussed in this comment. The entire premise of Conditional Focus is that the captured surface may be focused or not focused depending on the application's preference. This appeared to be acceptable to all concerned, as we were already diving deep into the particulars. It is reasonable to ask if that has changed, and if so, why.) |
FWIW I disagree with this characterization, as it associates the OP proposal with features that aren't being proposed (straw man). My comparison to user activation was merely meant to suggest our concern over sites missing a timeout seems overblown, because the task of focusing a window is inherently racy with end users anyway who may be focusing windows themselves if they become impatient. E.g. as a user if my CPU is so busy it takes 2 seconds for a site to request focus, then chances are I'll have clicked somewhere else already which might cause focus to never happen. I question letting such edge cases impact API design.
Would this unspecified behavior be identical to point 1? An advantage of leaving the default focus behavior unchanged perhaps is that if the controller grows more orthogonal features in the future (e.g. send capture actions), people might add the controller to use those features only and be surprised by the inadvertent change in focus behavior. So maybe that part is better? OTOH in the send capture handle case users probably DON'T want to focus, so it's hard to say which is better.
This seems like a rehash of #190 with all its microtask problems I raised there. Do we really want to go back there? A microtask-sensitive promise-based API sounds terrible and wouldn't even work with If we go that way, then a callback like @youennf suggested might work. But even that wouldn't stop JS from using XHR to do networking... |
Unspecified is unspecified, but at least in Chrome's implementation, yes, that's what I intend.
Exactly.
I don't see an alternative. Or did I misunderstand your mention of user-activation? Do you suggest we only use a timer? If so, my problems with that include:
I hope we can agree that
Why?
I'd have to refresh my memory on why we'd ended up backing away from callbacks. But it seems unnecessary. Why is it not simple to specify that |
After filing #231 I fear that focusing after capture has begun adds some risk. This perhaps limits the attractiveness of a timer unless it's quite short. cc @martinthomson
Good, we could still specify the invariant that it needs to be the same behavior.
Yes, that is all I proposed in the OP.
As I mentioned I question whether this is real. If my machine takes more than one second to run a task then it's too late to focus reliably anyway. I'd argue it becomes unsafe to focus then #231.
Not in my OP proposal.
getDisplayMedia returns a promise. See #190 (comment). Microtasks are fickle: const p = Promise.resolve();
function foo() {
return p;
}
async function bar() {
return p;
}
console.log(foo() === p); // true
console.log(bar() === p); // false E.g. simple stubs like this would FAIL, which seems unattractive: async function gDM(constraints) {
return navigator.mediaDevices.getDisplayMedia(constraints);
}
(async () => {
const controller = new CaptureController();
const stream = await gDM({controller});
controller.setFocusBehavior("focus-captured-surface"); // throws
})();
Ok I misunderstood the requirements then. A callback approach avoids this. E.g.: const controller = new CaptureController();
let stream;
controller.pleaseFocus = () => stream.getVideoTracks()[0].getCaptureHandle().origin != window.origin;
stream = await navigator.mediaDevices.getDisplayMedia({controller}); |
Callbacks are ugly, and tasks and microtasks are equally vulnerable to busy waiting. So I'd prefer e.g.: function queueTask(f) {
window.addEventListener("message", f, {once: true});
window.postMessage("hi", window.origin);
}
const controller = new CaptureController();
const stream = await navigator.mediaDevices.getDisplayMedia({controller});
controller.focus(); // works
await undefined;
controller.focus(); // works
await new Promise(queueTask);
controller.focus(); // throws If the next task takes over a second to run then something's seriously broken and focus fails, which seems good. Thoughts? |
The use cases I heard so far:
This does not prevent from pursuing focus() in parallel for more advanced use cases. |
Youenn, the very first post of the original thread mentions the need for conditionally focusing depending on what application is being captured. Please read the Sample Use Case section here. |
So the urgent use case you have in mind is web app focus decision based on information provided by capture handle identity? |
Correct. This has been presented in the WebRTC WG interim meetings over the last year. Speaking of which - happy anniversary to the original Conditional Focus thread! 🎂 And here is a deck from the WG meetings that explains this. See slides 12-14. The minutes reflect your contributions to the discussion. |
@jan-ivar, if, as you propose, we specify that both getDisplayMedia() and getDisplayMedia({controller}) must have the same default unspecified behavior in the absence of a call to setFocusBehavior(), then it will effectively mean that the latter will focus by default. This means that for applications that don't call setFocusBehavior(), focus always changes after N milliseconds. Repeat - not just when the CPU is unusually busy, but always. That's not fun for the user, so I'd call this decision mutually-exclusive with the decision to rely solely on a timer. ProposalI think we can avoid the shimming issues associated with a microtask if we go with a task instead. (I think @alvestrand and @youennf have both suggested that before.) Precisely formulated, when
Open questions
|
s/microtask/task.
This sounds good enough to me.
Rejecting the promise is probably better.
Controller is not transferrable so far.
It should probably be left to UA as this should be an edge case.
|
We can do the following if you'd like:
Note:
IIRC, when presenting to the WG, or possibly during our editor meetings, @jan-ivar spoke of the controller being transferable. We can start with it as untransferable and iterate, though. That's more than fine with me.
I'd still want focus-change to stop then, but it's not terribly important for me given the short timer, and how the user would not really notice the ordering of track-stopped focus-changed anyway.
Let's go with that then. It would be hard for me to commit to a value now. I'd rather set it experimentally and be free to tweak it if the need arises. Sounds like we're nearly there? If @jan-ivar agrees, I can send the PR. |
Why is it important to silent fail this case? It seems better to report the error. With regards to setFocusBehavior vs. focus, I think we should have a discussion about whether we think there is a path forward for supporting focus() at an arbitrary time (as long as capture happens). |
Because we want
Focus at arbitrary time - not at the moment, and I suspect not at all from Chrome's perspective. But suppose a path is found later - we can use it to deprecate
|
In the CropTarget case, the value returned by the function is the CropTarget itself. The question you might raise is whether there is enough value for exposing this information or not. |
I don't think we should report that. Please see earlier comments on this thread.
There are no immediate benefits to justify the complexity. If benefits are later discovered, changing from |
Great to see progress! Quick comments:
The last point is not for future proofing, but because it is simple and elegant. Browsers' behavior today assumes a specific task. It is taking automatic action toward that task. I think if you attach a controller then it's like turning off autopilot and taking over control, and the assumption flips. You are more likely to remote control the capture in some way, so not losing focus seems like a logical starting point. So far, the features planned for this controller (sendActions, maybe move identity?) fit this assumption. |
Regarding timer, with |
Do we have TPAC time for this? If so it is probably more urgent to make slides than a PR. We could present both options even. |
I'm happy to start with that. (I imagine you might wish to revise that in the future for Capture Actions. I think we'll be able to.)
Awesome, agreement there too.
Please see my point in the first paragraph of this message. Namely, your requirement of
Yes, I intend to use the time set for me to present Conditional Focus. Naturally this will touch on the controller. Do you want to pull our slots together or reorder them so that you may present your wider vision for the controller closer to the presentation of its first use of Conditional Focus? |
This appears to be the main remaining open issue. Let's tackle it. I'll explain my position again.
|
Summary of conclusions from the TPAC session of 2022-09-13 (slides):
|
Regarding window of opportunity, I think we should put a note that the goal is to allow for the current task only and that the current algorithm is an approximation. |
Works for me. |
This CL adds a new CaptureController object that can be passed to navigator.mediaDevices.getDisplayMedia() so that web developers can control whether the captured surface (tab or window only) will be focused or not. Spec: w3c/mediacapture-screen-share#230 Demo: https://capture-controller.glitch.me/ Bug: 1215480 Change-Id: Ib826b702fc853542a0c5bae5ddac3286216e2d63
This CL adds a new CaptureController object that can be passed to navigator.mediaDevices.getDisplayMedia() so that web developers can control whether the captured surface (tab or window only) will be focused or not. Spec: w3c/mediacapture-screen-share#230 Demo: https://capture-controller.glitch.me/ Bug: 1215480 Change-Id: Ib826b702fc853542a0c5bae5ddac3286216e2d63
This CL adds a new CaptureController object that can be passed to navigator.mediaDevices.getDisplayMedia() so that web developers can control whether the captured surface (tab or window only) will be focused or not. Spec: w3c/mediacapture-screen-share#230 Demo: https://capture-controller.glitch.me/ Bug: 1215480 Change-Id: Ib826b702fc853542a0c5bae5ddac3286216e2d63
Thanks for the PR! |
As promised in https://www.w3.org/2022/05/17-webrtc-minutes.html#t05 a dedicated controller object for getDisplayMedia.
An initial version would be aimed at solving #190 for now:
The controller is associated 1 ←→ 1 upon getDisplayMedia success, and cannot be re-associated after that.
The text was updated successfully, but these errors were encountered: