-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Capture, receive, and RTP timestamp concept definitions & normative requirements for gUM/gDM #156
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't the existing videoFrame.timestamp good enough for AV sync, especially when it's defined for both audio and video, and has a higher resolution to boot?
index.html
Outdated
Some video sources can supply information about when a video frame was captured. | ||
This information is useful for example for AV sync and end-to-end delay measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it help with AV sync if it's only on video frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolute audio capture timestamps are available via other web APIs right?
To try to answer the overarching question, presentation timestamps alone are insufficient if you want to do accurate AV sync on a local (think MediaRecorder) or on a remote receiver, especially in the presence of delays in the capture process, or insert VTGs in the capture pipeline to do video processing.
Is this answer enough or do you want to see further changes to clarify this in the spec text based on latest uploaded revision?
index.html
Outdated
<dt><dfn><code>captureTime</code></dfn> of type <span | ||
class="idlMemberType">DOMHighResTimeStamp, readonly</span></dt> | ||
<dd> | ||
<p> | ||
The capture time of the frame, defined as {{Performance.timeOrigin}} + {{Performance.now()}}. | ||
This is the user agent's best estimate of the instant the frame content was captured or | ||
generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @jan-ivar - I was a little trigger happy in uploading this PR, I will upload changes soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
captureTime
is absolute which enables delay measurements in a video pipeline. Also, video capture OS APIs typically provide media sampling timestamps which I've measured to be sometimes tens of milliseconds before they're available in a capture sink. I've tried to update what the use of the timestamps are in non-normative text.
I don't know why VideoFrame.timestamp
was specified in micros. DOMHighResTimestamp with resolution around 0.1ms seems fine.
The presentation timestamp seems a bit under-specified, but I gather it starts at 0 at the start of capture, and follows the "media timeline" which I suppose in theory can get out of sync (what happens to it with track.enabled = false?). So I can see a need here. I just think it would be good to write down the differences and why we need a video capture timestamp. |
41da924 is uploaded state prior to serving Youenn's request of extending with timestamp writing algorithms for local sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to cover remote tracks as well as local ones, then some clarifications are needed. But it might be simplest to focus on locally captured tracks.
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
Co-authored-by: Dominique Hazael-Massieux <[email protected]>
index.html
Outdated
</p> | ||
<p> | ||
Each video frame has a <dfn class="export">presentation timestamp</dfn> | ||
which is relative to the first frame appearing on the track. The timestamp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we state that the first video frame of a {{MediaStreamTrack}} has a presentation timestamp of 0
?
Or is it the first frame of the track's source?
Does this mean that the first video frame of a cloned track will have a timestamp of 0, or will cloned tracks have the same video frame timestamps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we state that the first video frame of a {{MediaStreamTrack}} has a presentation timestamp of 0?
This - updated with clarification that the first frame's timestamp is 0.
Does this mean that the first video frame of a cloned track will have a timestamp of 0, or will cloned tracks have the same video frame timestamps?
Having it start with 0 for cloned tracks is consistent with the definition as written here. Trying to specify 0 for the first frame from a source could be interpreted as the same timestamp sequence is replicated across all tabs - we don't want this for privacy reasons, right?
The user agent MUST set the [= capture timestamp =] of each video frame that is sourced from | ||
{{MediaDevices/getUserMedia()}} and {{MediaDevices/getDisplayMedia()}} to its best estimate of the time that | ||
the frame was captured, which MUST be in the past relative to the time the frame is appearing on the track. | ||
This value MUST be monotonically increasing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At TPAC, I think there was consensus on having camera tracks timestamp == capture timestamp.
It does not seem to be the case here though I would guess we keep the idea that the same clock is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry can you elaborate on what you mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per https://jsfiddle.net/4yzmwnsL/, it seems both Chrome and Safari use the same frame.timestamp
for cloned tracks, even though the second track was cloned 2 seconds after the first one. This seems to be in contradiction with this PR.
Instead, implementations seem to be aligned with the idea that capture timestamp
and presentation timestamp
are both set by the source, not by the track.
I am also not totally clear of the difference between these two timestamps for local sources.
It seems that the time the frame is appearing on the track
would be thepresentation timestamp
, and would be slightly greater than the capture timestamp
.
If that is the case, we should refer to both capture timestamp
and presentation timestamp
in that section of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's right, I spoke to this on TPAC. Currently, Chrome emits absolute capture timestamps from gUM/gDM capture to the webcodecs timestamp
slot in MSTP. We believed we could do this since the nature of the timestamp isn't really well specified in webcodecs. It was later found that there are web apps that depend on a 0-based timeline and we have heuristics to support both cases in MSTG. The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.
presentation timestamp
is 0-based and increments by frame duration. We intend to change Chrome back to 0-based here after the PR series is landed.
capture timestamp
is absolute & unobservable on MediaStreamTracks but there are behavior requirements stated for gDM/gUM in this PR. It becomes observable first in VideoFrameMetadata
creation at the MSTP.
I am also not totally clear of the difference between these two timestamps for local sources.
For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N. presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i]
i.e. same clock rate, different origin.
Suggestions on how to make this clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.
That is fine to me.
presentation timestamp
is 0-based and increments by frame duration.
This PR assumes that 0
is per track. Meaning that track and track.clone() would not have the same definition of 0
. This seems unnatural to me. I would tend to go with 0
as the time of the first frame emitted by the source.
For VTG, 0
would refer to the time the first video frame is enqueued.
For this PR, that would mean changing which is relative to the first frame appearing on the track.
to which is relative to the first frame emitted by the track's [[Source]].
Or it could be 0
per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of 0
(provided the web app reads the frames quickly enough).
It was later found that there are web apps that depend on a 0-based timeline
Maybe these web apps (and the heuristics you mentioned) could tell us whether 0
should be per track, per track's sink or per track's source.
For gUM/gDM sources, assume frame sequence indices 0, 1, 2, i ... N.
presentation timestamp[i + 1] - presentation timestamp[i] = capture timestamp[i + 1] - capture timestamp[i] = VideoFrame.duration[i]
i.e. same clock rate, different origin.
That makes sense to me.
Suggestions on how to make this clearer?
I would add some wording stating that presentation timestamp
and capture timestamp
are using the same clock and have a constant offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
presentation timestamp
is 0-based and increments by frame duration.This PR assumes that
0
is per track. Meaning that track and track.clone() would not have the same definition of0
. This seems unnatural to me. I would tend to go with0
as the time of the first frame emitted by the source. For VTG,0
would refer to the time the first video frame is enqueued.For this PR, that would mean changing
which is relative to the first frame appearing on the track.
towhich is relative to the first frame emitted by the track's [[Source]].
Or it could be
0
per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of0
(provided the web app reads the frames quickly enough).
I think 0
per track source is what makes most sense. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this PR series is to clean this whole topic up and stop using 1 field for 2 things.
That is fine to me.
presentation timestamp
is 0-based and increments by frame duration.This PR assumes that
0
is per track. Meaning that track and track.clone() would not have the same definition of0
. This seems unnatural to me. I would tend to go with0
as the time of the first frame emitted by the source. For VTG,0
would refer to the time the first video frame is enqueued.For this PR, that would mean changing
which is relative to the first frame appearing on the track.
towhich is relative to the first frame emitted by the track's [[Source]].
I think 0 per track source is what makes most sense. WDYT?
Done in a8fc811.
Or it could be
0
per track's sink, meaning that MediaStreamTrackProcessor would always provide its first frame to the web app with a timestamp of0
(provided the web app reads the frames quickly enough).It was later found that there are web apps that depend on a 0-based timeline
Maybe these web apps (and the heuristics you mentioned) could tell us whether
0
should be per track, per track's sink or per track's source.
The usages we found would work great if MSTP exposes 0-based from creation.
Suggestions on how to make this clearer?
I would add some wording stating that
presentation timestamp
andcapture timestamp
are using the same clock and have a constant offset.
Done in a8fc811.
Co-authored-by: youennf <[email protected]>
Co-authored-by: youennf <[email protected]>
That seems indeed better compared to per track.
That is closer to per track's sink timestamps, but... If we do that change in MSTP, do we need |
I was expressing myself sloppily, I meant what you wrote. So for this PR, we could prepare for that by adding an |offset| parameter to the "Initialize Video Frame Timestamps From Internal MediaStreamTrack Video Frame" algorithm, which is specified later from mediacapture-transform as "the
In general yes, it's not possible to recreate any accurate |
We could consider that MSTP is receiving a VideoFrame object and use https://w3c.github.io/webcodecs/#videoframe-initialize-frame-from-other-frame to create a new VideoFrame with an updated timestamp.
It seems that
Based on this, the fact that the The only requirement could be that presentation timestamps have the same clock as capture timestamps for local sources. |
Yes,
I buy requiring the same clock rate, but I don't see the requirement to require equality for the offset. Given sinks that expose Wdyt we just add a note explaining this and that it's allowable to have the same clock (rate and offset) for |
Agreed.
Right, this seems editorial. |
…deo frame writing algorithm.
@dontcallmedom I can't parse what respec complains about, ideas? |
looks like it was a temporary hiccup, re-running the job fixed it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once we say something of the [=presentation timestamp=] of local video capture, either same value as [=capture timestamp=] or with a fixed offset.
Plus the removal of a MUST we cannot really test.
Co-authored-by: youennf <[email protected]>
Only for local tracks of course. |
Thanks for all the activity! Let me have a look. I'd also love for @padenot to look this over. |
All good from my perspective, with comments from @youennf addressed. Please remember to put this upstream and not here, thanks. |
@padenot thanks - can you just clarify what you meant by "upstream and not here"? We had one interpretation on the editors meeting but reading again I'm not sure that was right. |
…e same for local capture
Just want to share my appreciation for adding capture timestamp. This is exactly what we are missing for having accurate multi-stream recording in tella.tv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I approve of overall direction. Happy to file followups if I find something.
Partly addresses w3c/webcodecs#813 (review).
Defines capturetime in mediacapture-extensions.
Preview | Diff