Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getViewerPose()s use of XRSpace doesn't quite specify how views work #565

Closed
Manishearth opened this issue Mar 18, 2019 · 17 comments · Fixed by #614
Closed

getViewerPose()s use of XRSpace doesn't quite specify how views work #565

Manishearth opened this issue Mar 18, 2019 · 17 comments · Fixed by #614
Labels
fixed by pending PR A PR that is in review will resolve this issue.
Milestone

Comments

@Manishearth
Copy link
Contributor

Fundamentally, the XRSpace is a position and an orientation, one that may change every frame.

getViewerPose() gets pose and view information for a viewer given an XR space.

However, the XR space has no inherent concept of a "view". When in immersive mode, aside from eye-level spaces, it doesn't quite make sense to request view data for the space since there's nothing telling you where the eyes are.

For example, what should getViewerPose(identity) return in immersive mode? Should it return just a single view? Should it return two views from the identity matrix, with the eyes offset based on device data? Which direction are the eyes offset in? Does this direction change as the viewer moves their head?

@Manishearth
Copy link
Contributor Author

Similarly, when there are multiple views, what exactly is originOffset affecting? Can I apply it by simply premultiplying the offset to the view matrices, or is there some deconstruction of pose data that I must first do?

It may be worth explicitly modeling this as a mathematical thing based off of pose information obtained from the device.

@Manishearth
Copy link
Contributor Author

Okay, so from reading the chromium source (which is in part written by @toji so I assume it follows the intent of the spec 😄 ) it seems like:

  • Fundamentally, a view is a set of offsets from the space. This may or may not involve orientation changes, but does not currently do so in the chromium impl, presumably because XR devices don't yet have/report offset orientation for eyes, which isn't too important anyway since the devices are for humans, not deer.
  • getViewerPose returns views obtained by applying the offset directly to the xrspace's pose (with originOffset applied). Essentially, it is "what if the current XR device were mounted at this XRSpace's pose".
  • eye-level is basically the default pose that most devices report. floor-level is the same pose with a "floor level transform" applied. "position-disabled" is the same pose with the position information zeroed out. getViewerPose() on these returns a set of view matrices that incorporate the pose information suitable for displaying.
  • getViewerPose(identity) basically will return a fixed view matrix, "what if the device were mounted at the origin"

I think the missing piece was understanding that getViewerPose() is "mounting" the XR device on whatever space you give to it, so different spaces don't have different poses.

I think we should explicitly mention this in the spec, by noting:

  • the session keeps track of the offset of the different views
  • each XRSpace has an inherent rigid transform (separate from originOffset), individual space types define how to obtain this for a given frame. It may be worth making input XR spaces their own type, as well as perhaps the identity reference XR space.
  • getViewerPose(space) computes space.transform() * space.offset * viewoffsets for each view (I may have gotten the multiplication order wrong).
  • probably explicitly mention all the matrix math

I'd love to take a crack at writing spec text if y'all think this would be useful. I'll probably wait until I finish writing the code for servo (and figuring out the precise matrix math involved)

bors-servo pushed a commit to servo/servo that referenced this issue Mar 21, 2019
Some XRSpace improvements

Proper XRSpace support is blocked on immersive-web/webxr#565 , but in the meantime this improves XRSpace support a little bit, preparing both for support in getViewerPose and getPose as well as handling input spaces eventually.

r? @jdm

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/servo/servo/23055)
<!-- Reviewable:end -->
@klausw
Copy link
Contributor

klausw commented Mar 22, 2019

@Manishearth , thanks for opening these discussions - I also think that this is somewhat underspecified in the current spec, especially since two people reading the spec may each end up with their own plausible interpretation which ends up being incompatible.

As per my comments on issue #567, I think it helps to clearly describe the poses and transforms as changing coordinates from one space to another space. Here's how I understand it, please let me know if I got it wrong.

Let's say you're on a 6DoF headset and the low-level VR API returns coordinates in tracking space, where the floor center of your play area is (0, 0, 0), and taking a step backwards puts your headset somewhere around (0, 1.6, 1). If you've requested a stationary/standing or bounded reference space, the user agent can treat tracking space and world space as equivalent (ignoring originOffset for now), and getViewerPose would return a pose with a position of (0, 1.6, 1), representing a transform from headset-relative coordinates to world space coordinates.

The per-eye poses would have a small offset from that, i.e. (-0.03, 1,6, 1) for the left eye, and possibly also a small rotation. (The Pimax 5k+ headset has angled screens which needs to be represented by such a rotation, this doesn't currently work right in Chrome.) These are basically transforms from eye space to world space. Feeding the eye space origin point (0, 0, 0) into the eye's rigidTransform matrix gives you the eye position in world space.

If you have a 3DoF headset, the low-level VR API would internally return coordinates near (0, 0, 0) with a neck model applied. If you request eye-level reference space, the user agent would return those as-is. If you request stationary/standing reference space, the user agent should still return headset poses around (0, 1.6, 0), so it adds a tracking space to standing space transform that applies an assumed floor offset:

   standing_from_3DoFtracking = (
      1 0 0 0
      0 1 0 1.6
      0 0 1 0
      0 0 0 1
   )
   point_standing = standing_from_3DoFtracking * point_tracking

Conversely, if you have a 6DoF headset and request eye-level poses, the low-level VR APIs would still track 6DoF position, but there'd be a center point inside the tracked area that's treated as the eye-level origin. For example, in SteamVR you mark this point by using the "reset seated origin" function.

Result would be something like this:

   eyelevel_from_6DoFtracking = (
      1 0 0 0
      0 1 0 -1.6
      0 0 1 0
      0 0 0 1
   )
   point_eyelevel = eyelevel_from_6DoFtracking * point_6DoFtracking

(This is just an example. If the low-level API natively supports an eye-level tracking equivalent, the implementation should use that directly since the native API can apply additional logic useful for seated experiences, for example hiding the SteamVR chaperone while your head is near the seated origin.)

originOffset is a transform from world space to tracking space (see issue #567), so you'd get something like this when combining it on a 3DoF headset:

   point_world = inverse(originOffset) * standing_from_3DoFtracking * point_3DoFtracking

If you want a view matrix, you want a transform that goes from world space to eye space, so you'd apply the inverse of the eye pose's rigidTransform. Specifically, this view matrix would transform a world point at the eye position to (0, 0, 0).

@thetuvix
Copy link
Contributor

The mental model I find most productive for viewers, input sources, reference spaces and all other XrSpaces is to think of them all as peer entities, each positioned and oriented somewhere out there in the physical world. Some entities like viewers and input sources are dynamic, with their origins moving around each frame. Some entities like reference spaces are generally static, with their origins remaining logically fixed in the physical world after creation. (with specific tweaks for the rules defined for the various reference spaces)

For a given WebXR session, there is only one viewer in the physical world - if it's a stereo viewer, that viewer is composed of two views, each positioned and oriented themselves within the physical world. When you call getViewerPose and pass in a reference space, you get transform ("center eye"), views[0].transform (left eye) and views[1].transform (right eye), each expressed relative to the origin of the physically-fixed reference space you asked about. If you happened to ask about a different reference space, you'd get the same 3 poses, expressed within that other reference space instead.

I think the missing piece was understanding that getViewerPose() is "mounting" the XR device on whatever space you give to it, so different spaces don't have different poses.

I'm not sure I follow the analogy of "mounting" the XR device on the space here. To me, that implies that the device is rigidly locked to the space in question.

The stationary eye-level space is not a space that moves with the user each frame. Instead, it is a coordinate space that, once established, stays fixed in the physical world, with its origin at the location of the user's eyes when the space was first created. When you call getViewerPoses each frame, you are asking the question "what is the position and orientation of the viewer and its 1-2 views this frame, relative to the fixed origin of this reference space"? If a device only has 3DoF tracking capabilities, the position relative to that origin can only incorporate a neck model rather than true 6DoF positional tracking, and so emulatedPosition will be true. However, logically, the app operates the same either way, asking how far the head has moved (in position and orientation) relative to their reference space's physically-fixed origin, and positioning their camera object within their scene at that location, with the child left-eye and right-eye view matrices determined by the two per-view transforms.

@thetuvix
Copy link
Contributor

This mental model also provides a natural definition for originOffset. For any reference space, there would be a natural origin of that space if originOffset was left as identity. For example, the bounded reference space will have its natural origin located physically on the floor in the center of the bounds rectangle, facing forward along that rectangle's -Z axis.

Then:

  • If originOffset is set to identity, the created reference space's origin will line up exactly with that natural origin.
  • If originOffset is set to a non-identity value, the created reference space's "origin" will be "offset" from that natural origin by the specified position and orientation, using the origin's original axes to interpret the offset.

Either way, the resulting XRReferenceSpace ends up with an origin located at a fixed location in the physical world. When you later call getViewerPose and ask for the viewer pose relative to that bounded reference space, you are asking for this frame's viewer pose relative to that fixed physical location.

Agreed that we should be sure all of this is specified very exactly! However, we should take pains to avoid any mention of "tracking space origin" or other such implementation details in the spec that presume something about the manner in which the underlying XR platform is built.

For example, some native XR APIs have a single eye-level tracking space origin, with all poses expressed relative to that origin. However, other systems like Windows Mixed Reality and HoloLens allow users to walk around large unbounded areas - there is no canonical tracking origin there in which all coordinates are expressed. Instead, users make bounded or unbounded reference spaces as needed, and then ask for head poses, hand poses, etc. relative to one of those spaces.

Each well-known reference space type in WebXR is defined by how its origin is positioned and oriented relative to the physical world when created and how it adjusts over time. The design goal has been to choose definitions that can behave in an observably consistent manner for app developers across different UAs that span disparate tracking technologies. It is up to the UA to manifest the defined contract of that reference space using whatever APIs are exposed by its underlying XR platform.

@klausw
Copy link
Contributor

klausw commented Mar 22, 2019

@thetuvix I agree that the spec shouldn't refer to internal details of the browser implementation or underlying low-level VR APIs. My previous comment was from an implementor's point of view since @Manishearth was asking about how the different types of reference spaces relate to each other in terms of transforms, and that is an implementation detail that's not directly exposed to users of the WebXR API.

I think we're in agreement that a "space" basically consists of an origin and unit axis vectors that correspond to locations in the real world, and different spaces can be related to each other with XRRigidTransforms.

However, we do need some additional terminology to explain how things work. Currently, an XRReferenceSpace is actually two distinct spaces. There's the "without originOffset" space that you'd get with an identity originOffset, and the "with originOffset" space that is in effect when using the reference space with an arbitrary originOffset. For practical purposes the latter is the "real" reference space since that's what's effective when using the reference space in API calls, but it's really difficult to explain how originOffset works if there's no name for the underlying "without originOffset" space.

I've been roughly following @toji's terminology from his drawings in issue #477, calling the "without originOffset" reference space "tracking space", and the transformed reference space "virtual world space". In this sense "tracking space" is an actual concept that's part of the WebXR API and not just an implementation detail, but I'm very much open for suggestions for alternate terminology.

Here's a proposal for the spec to make XRRigidTransform a bit more precise - would people agree with something like this?

An XRRigidTransform expresses the relationship between two spaces. The XRRigidTransform A_from_B consists of a position and rotation, where the position is the location of space B's origin in space A's coordinate system, and the rotation determines the orientation of space B's axes around that origin. In matrix terms, the rotation part of the matrix contains the B space's axis unit vectors expressed in space A coordinates, and the translation part contains the position of space B's origin in space A coordinates. This matrix converts coordinates from space B to space A:

# (BOrigin_x, BOrigin_y, Borigin_z) is the position of space B's origin
# expressed in space A's coordinate system.

A_from_B = new XRRigidTransform(
  position: { x: BOrigin_x, y: BOrigin_y, z: BOrigin_z},
  rotation: A_from_B_quaternion
}

# (BXaxis_x, BXaxis_y, BXaxis_z) is a unit vector representing
# space B's X axis expressed in space A's coordinate system.

A_from_B.matrix = (
   BXaxis_x BYaxis_x BZaxis_x BOrigin_x
   BXaxis_y BYaxis_y BZaxis_y BOrigin_y
   BXaxis_z BYaxis_z BZaxis_z BOrigin_z
   0        0        0        1
)

space_A_coordinates = A_from_B.matrix * space_B_coordinates

Using this terminology, originOffset is the XRRigidTransform unoffsettedReferenceSpace_from_effectiveReferenceSpace, where "unoffsetted reference space" is what I was previously calling "tracking space", and "effective reference space" is the virtual world space.

The viewer pose's transform is the XRRigidTransform referenceSpace_from_viewerSpace, where viewer space's origin is the headset position (for example the midpoint between the user's eyes), and the viewer space's -Z axis points forward in viewing direction.

An eye view pose's transform is the XRRigidTransform referenceSpace_from_eyeSpace, where eye space's origin is the eye's location. The transform's position component is the coordinates of the eye location in reference space.

@klausw
Copy link
Contributor

klausw commented Mar 22, 2019

Sigh, this is confusing, I got originOffset backwards while trying to explain it. I've edited it, here's the corrected paragraph.

Using this terminology, originOffset is the XRRigidTransform unoffsettedReferenceSpace_from_effectiveReferenceSpace, where "unoffsetted reference space" is what I was previously calling "tracking space", and "effective reference space" is the virtual world space.

@Manishearth
Copy link
Contributor Author

Manishearth commented Mar 28, 2019

@thetuvix

I'm not sure I follow the analogy of "mounting" the XR device on the space here. To me, that implies that the device is rigidly locked to the space in question.

The thing I'm grappling with here is "what happens when you do getViewerPose() on a space that isn't related to the headset and doesn't have eyes" -- i.e. when you use the space of an input device. The "mounting" was to say "pretend the device is sitting on the pose of the given space", but that's actually not the case given other things you've said:

The stationary eye-level space is not a space that moves with the user each frame. Instead, it is a coordinate space that, once established, stays fixed in the physical world, with its origin at the location of the user's eyes when the space was first created.

So initially I thought this was true, however then:

  • position-disabled doesn't quite make sense as a reference space, since it also affects the viewer.
  • the nature of identity reference spaces gets confusing, given this explanation I would imagine that identity reference spaces behave identically to stationary eye-level reference spaces aside from having an offset, but this doesn't match the chromium impl: getViewerPose() on identity returns an identity pose with eye offsets and never changes, getViewerPose() on eye-level returns the relative pose of the viewer, also with eye offsets, and changes every frame.

Overall it seems like reference spaces are not just coordinate systems, but rather have additional magic on how they affect the viewer when used in getViewerPose. One potential of looking at this is "the identity reference space tracks the viewer" and "position-disabled follows the user around without changing orientation", however this affects how getPose works too, so that's not quite accurate. So you basically need two descriptions of the reference space depending on whether it's being used in getViewerPose or getPose. In short, getPose(viewerSpace, referenceSpace) and getViewerPose(referenceSpace) are not the same thing.

It seems like this mental model seems incomplete given the abilities reference spaces have (and also the term "reference space" may be misleading here). They seem to have the ability to affect getViewerPose() in ways that do not derive from them being simple reference spaces (i.e. coordinate transforms).

I think some concrete questions that might clear things up are:

  • What happens to the results of getViewerPose() as the user moves around when called with: eye-level, position-disabled, and identity? (I.e. i'm interested in what parts of the pose change, if at all)
  • What happens to the results of getPose() as the user moves around when called with each pair of: eye-level, position-disabled, and identity?

@klausw
Copy link
Contributor

klausw commented Apr 2, 2019

(Apologies in advance in case I made mistakes in this answer, it's easy to get signs or transform directions wrong. It's intended to match my proposed clarifications along these lines in https://github.com/immersive-web/webxr/pull/569/files - change "diff settings" to "split" to ensure the long lines from the index.bs file don't get truncated.)

@Manishearth wrote:

The thing I'm grappling with here is "what happens when you do getViewerPose() on a space that isn't related to the headset and doesn't have eyes" -- i.e. when you use the space of an input device. The "mounting" was to say "pretend the device is sitting on the pose of the given space", but that's actually not the case given other things you've said:

I think "the device is sitting on the pose of the given space" is a confusing way to put things. A pose is essentially a transform between two spaces. A pose of an object corresponds to a transform from its object-attached XRSpace to an reference space or other destination space, and provides a way to get coordinate values in the destination space. getViewerPose(refSpace) is equivalent to getPose(source=viewerSpace, destination=refSpace), except that getPose is more general since the destination can be any XRSpace, not just an XRReferenceSpace.

So you can do getPose(source=viewerSpace, destination=inputGripSpace), and the resulting pose is the transform from viewerSpace to inputGripSpace, and the pose transform's matrix is inputGripSpaceFromViewerSpace. If you have coordinates in viewer space, you can left-multiply them with that matrix and get coordinates in inputGripSpace. For example, if you plug in the viewer space origin, and you're looking at the back of your right hand holding a controller, you'd get coordinates in that controller's space where +X points towards the back of the hand, so your resulting coordinates would be something like (+20, 0, 0), and this would also be the position component of the transform of the pose you got from getPose in this case. (The position component is the coordinates of the source space's origin in the transform destination space's coordinate system.) If you move the hand 5cm further away, the pose position would change to something like (+25, 0, 0).

position-disabled doesn't quite make sense as a reference space, since it also affects the viewer.

It cannot affect the viewer, no amount of math you do in the implementation will move the viewer around unless you have a haptic suit or motion platform that can physically grab them and move them around ;-)

Instead, what's happening is that the origin of the position-disabled reference space follows your head position (but not orientation) as you move around. Imagine the origin of that space stuck to the bridge of your nose, but its XYZ axes keep pointing in the same world directions (i.e. Z=north, Y=up) even when you move your head. So the viewer pose in relation to a position-disabled ref space has just the rotation component, position is zero. If you get a controller pose in that space, the position would be relative to that point between youreyes.

the nature of identity reference spaces gets confusing, given this explanation I would imagine that identity reference spaces behave identically to stationary eye-level reference spaces aside from having an offset, but this doesn't match the chromium impl: getViewerPose() on identity returns an identity pose with eye offsets and never changes, getViewerPose() on eye-level returns the relative pose of the viewer, also with eye offsets, and changes every frame.

No, the identity reference space when used for querying viewer poses is effectively a coordinate system glued to the bridge of your nose that moves along with your head, so that +X always points towards your right eye, assuming you haven't changed originOffset. The transform from that to your viewer space is always an identity matrix, hence the name. If you use originOffset, its inverse gets applied to poses you query from it. You'll never see any actually measured headset movement or rotations in those poses, only what you set yourself via originOffset. Using it with controllers would be weird even if mathematically possible - the intended use is for inline nonimmersive sessions where you wouldn't get controller poses.

Overall it seems like reference spaces are not just coordinate systems, but rather have additional magic on how they affect the viewer when used in getViewerPose. One potential of looking at this is "the identity reference space tracks the viewer" and "position-disabled follows the user around without changing orientation", however this affects how getPose works too, so that's not quite accurate. So you basically need two descriptions of the reference space depending on whether it's being used in getViewerPose or getPose. In short, getPose(viewerSpace, referenceSpace) and getViewerPose(referenceSpace) are not the same thing.

I think you had it right initially, there's no extra magic. getPose(viewerSpace, referenceSpace) and getViewerPose(referenceSpace) are the same thing, and your descriptions of identity and position-disabled match what I had described above. Where do you see a mismatch?

I think some concrete questions that might clear things up are:

What happens to the results of getViewerPose() as the user moves around when called with: eye-level, position-disabled, and identity? (I.e. i'm interested in what parts of the pose change, if at all)

eye-level is a fixed origin at a given point in space, i.e. your head's resting position when sitting in your gaming chair. If you lean 30cm to the left and have a 6DoF headset, you'll get a position of (-30, 0, 0) or so in eye-level reference space, and exactly (0, 0, 0) for position-disabled and identity. If you also tilt your head, you'll get a corresponding orientation for eye-level and position-disabled poses, but the identity reference space pose will not change at all in response to head movements.

If you have a 3DoF headset, the eye-level pose would have a smaller position change in response to head movement based on a neck model (it can't detect leaning), while position-disabled and identity would behave the same as a 6DoF headset.

What happens to the results of getPose() as the user moves around when called with each pair of: eye-level, position-disabled, and identity?

Exactly the same as getViewerPose().

Does that help?

@Manishearth
Copy link
Contributor Author

Manishearth commented Apr 2, 2019

Thank you, this helps a bunch!

getViewerPose(refSpace) is equivalent to getPose(source=viewerSpace, destination=refSpace)

Not quite, though, getViewerPose returns two transformations (for headsets), while getPose returns just one. The "mount the headset on the pose" statement is my attempt to clarify their relationship 😄

(It seems like your description matches my perception of getViewerPose(), i.e. it does the same thing as getPose except that it applies an additional eye transform afterwards, which is what I meant by "mount the headset on the pose")

Instead, what's happening is that the origin of the position-disabled reference space follows your head position (but not orientation) as you move around
..
No, the identity reference space when used for querying viewer poses is effectively a coordinate system glued to the bridge of your nose that moves along with your head, so that +X always points towards your right eye, assuming you haven't changed originOffset

Ah, this gets to the core of my confusion; this isn't at all clear from the spec or the spatial tracking explainer 😄

The word "identity" reference space strongly evokes an image of a reference space at rest at (0,0,0) (in some stationary reference space). This is further confusing for position-disabled because it's even subclassed as a stationary reference space, but it isn't stationary! It's only stationary with respect to the viewer, which IMO isn't a natural definition of "stationary". But that doesn't even matter because if we use that as the definition of a stationary reference space, then identity should be stationary and *-level should not be. We do have an actual definition of what "stationary" means but it's a bit awkward, more on this below.

Given the (to me) "natural" definition of what identity and position-disabled do, and given that these reference spaces are defined with how getViewerPose() interacts with them, I ended up thinking that they work differently with getPose() and getViewerPose()

Regarding the specced definition of "stationary", we currently have

An XRStationaryReferenceSpace represents a tracking space that the user is not expected to move around within
...
An XRReferenceSpace describes an XRSpace that is generally expected to remain static for the duration of the XRSession, with the most common exception being mid-session reconfiguration by the user.
...
An XRSpace describes an entity that is tracked by the XR device's tracking systems

This seems like an inconsistent set of definitions. eye-level doesn't match the definition of XRSpace, since it's not tracked, it's a ghost entity. identity and position-disabled don't match the definition of XRReferenceSpace, since they don't remain static but instead follow the headset.

And the definition of XRStationaryReferenceSpace seems to be more in terms of how it's supposed to be used, which feels somewhat awkward -- it doesn't tell me what it means.

This is also pretty confusing and inconsistent, and is part of what made it hard for me to understand what identity/position-disabled did.

This may belong in a separate issue, but it's all very closely related so I'll leave the comment here, lmk if I should split this out. It does have some more pressing concerns since while the main topic of this thread is largely due to the spec being unclear, here we have the spec making false statements which should definitely be fixed.

@Manishearth
Copy link
Contributor Author

I guess we can do a bunch of things here:

  • Improve the definitions of the various space types and perhaps recategorize them.
  • Explicitly clarify that getViewerPose(ref) is equivalent to getPose(source=viewerSpace, dest=ref)
  • Rename identity to "viewer-tracking" or something, maybe? I suspect there may be good reasons I have missed behind the current name, though.
  • Explicitly clarify that XRSession.viewerSpace is identical to requesting an identity space
  • Explicitly clarify the poses of all these spaces as an independent concept, unrelated to their behaviors in getViewerPose()
  • Eventually explicitly list the matrix math so confusions like the originOffset issue are less likely

@klausw
Copy link
Contributor

klausw commented Apr 2, 2019

Not quite, though, getViewerPose returns two transformations (for headsets), while getPose returns just one. The "mount the headset on the pose" statement is my attempt to clarify their relationship 😄

Ah, it wasn't clear to me that this is what you meant. It's a bit more complicated. getViewerPose returns an XRViewerPose. This is-a XRPose, and as such contains a transform member that's a XRRigidTransform position + orientation, and this transform is exactly equivalent to the transform you'd get in the XRPose returned from getPose(viewerSpace, refSpace). This pose is for "viewer space" which is roughly placed at the midpoint of the user's eyes.

However, in addition to the XRPose transform, XRViewerPose also has a views array, and each view in that array has its own transform corresponding to a specific view. For a simple HMD each of those corresponds to an eye space with its corresponding offset and forward direction, but it could also be more complicated such as a display with angled screens or multiple screens per eye.

The word "identity" reference space strongly evokes an image of a reference space at rest at (0,0,0) (in some stationary reference space).

The name just means that querying viewer poses against it will always get you an identity transform. Maybe it would have been more consistent to call it position-and-orientation-disabled, but that would be clunky. It can't work the way you'd prefer since it's specifically intended for situations with no tracking, including inline mode where there's no hardware at all, so it doesn't know anything about stationary reference spaces.

This is further confusing for position-disabled because it's even subclassed as a stationary reference space, but it isn't stationary! It's only stationary with respect to the viewer, which IMO isn't a natural definition of "stationary". But that doesn't even matter because if we use that as the definition of a stationary reference space, then identity should be stationary and *-level should not be. We do have an actual definition of what "stationary" means but it's a bit awkward, more on this below.

Yes, the "stationary" spaces are intended for mostly-stationary users, though this isn't actually enforced (you can walk freely within the tracked area). And yes, the position-disabled and space is a special case.

This seems like an inconsistent set of definitions. eye-level doesn't match the definition of XRSpace, since it's not tracked, it's a ghost entity.

No, eye-level is a normal tracking space, it's just intended to give systems flexibility to do the right thing depending on hardware capabilities. On a 6DoF headset, it's very similar to floor-level, just with a different static origin, but floor-level and eye-level would typically be related by a simple transform that doesn't change when you move around. Only an explicit "reset seated orientation" or similar would change that transform between them. Imagine the coordinate axes for floor-level being stuck on the middle of the play area, and the eye-level coordinate axes could either be right above that, or be placed where the resting head position is in a gaming chair off to the side.

On a 3DoF headset, eye-level would have an origin that moves along with the base of your neck according to the inverse of the neck model, so the origin isn't static if you do translation movement such as leaning, but that is uninteresting since the application has no way of telling that it's not static. It's basically equivalent to the 6DoF version of eye-level as long as you keep your torso in place and just tilt your head.

identity and position-disabled don't match the definition of XRReferenceSpace, since they don't remain static but instead follow the headset.

You're right that this is inconsistent, and the spec shouldn't say that reference spaces are stationary without clarifying these exceptions.

Improve the definitions of the various space types and perhaps recategorize them.

I'm in favor of clarifying definitions, but I think they fit the use cases pretty well, so I think it would be reasonable to stick with the current names.

Explicitly clarify that getViewerPose(ref) is equivalent to getPose(source=viewerSpace, dest=ref)

The spec does say that already.

Rename identity to "viewer-tracking" or something, maybe? I suspect there may be good reasons I have missed behind the current name, though.

I think that would be even more confusing since it's what you use when you can't track the viewer...

Explicitly clarify that XRSession.viewerSpace is identical to requesting an identity space
Explicitly clarify the poses of all these spaces as an independent concept, unrelated to their behaviors in getViewerPose()
Eventually explicitly list the matrix math so confusions like the originOffset issue are less likely

I think my proposed changes should help with some of these, for example it explicitly defines matrix entries and clarifies how how pose positions relate to coordinate values. Please let me know what you think.

@Manishearth
Copy link
Contributor Author

Heh, I hadn't realized #496 had happened, I was still looking at XRViewerPose as a non-subclass since I never updated Servo's viewerPose code to deal with #496.

The name just means that querying viewer poses against it will always get you an identity transform.

Overall I think this is one of the reasons I kept getting confused, a lot of space things are expressed in terms of how they behave in getViewerPose(), which can be confusing, since spaces seem to be intended to be this general concept that gets used in a bunch of ways for a bunch of things.

Yes, the "stationary" spaces are intended for mostly-stationary users

Yeah, as I said below it feels like naming the space after what you're supposed to use it for as opposed to what it is seems a bit weird.

One potential "fix" is to just flatten the types and get rid of XRStationaryReferenceSpace entirely, folding everything into XRReferenceSpace directly.

Another might be to clarify that the type argument to requestReferenceSpace() is one asking for the type of immersive experience you're looking for. But that may be tricky to get across in the API. But also I don't think this matters too much; the actual type name doesn't matter to end users, and the API argument is sufficiently ambiguous that we can document it to clarify that it's the type of immersive experience.

No, eye-level is a normal tracking space

I meant that it doesn't track any real entity, so it's a bit confusing. It's not wrong per se, a ghost entity positioned where your head was at t=0 is still an "entity" 😄

On a 3DoF headset, eye-level would have an origin that moves along with the base of your neck according to the inverse of the neck model, so the origin isn't static if you do translation movement such as leaning, but that is uninteresting since the application has no way of telling that it's not static. It's basically equivalent to the 6DoF version of eye-level as long as you keep your torso in place and just tilt your head.

Ah, the way I'm looking at these is a bit inverted -- I look at these spaces as behaving the same for 3DOF and 6DOF, however in 3DOF the application gets no positional data so pretends the headset doesn't change position (ignoring neck modeling). I do think this might be a better way to spec this if we need (eventually it would probably be useful to have notes explaining what happens for 3DOF devices in the spec) since it unifies the behavior of all spaces across devices, and you simply have to define how the viewer is tracked for various kinds of devices.

(I don't think this is a priority but once we clarify all the other stuff I may take a crack and clarifying this)

I think my proposed changes should help with some of these

It does, thank you for that! The individual spaces still need more definitions, but we can do that separately (I might try to write a PR for it once yours lands)

@klausw
Copy link
Contributor

klausw commented Apr 2, 2019

I meant that it doesn't track any real entity, so [eye-space is] a bit confusing. It's not wrong per se, a ghost entity positioned where your head was at t=0 is still an "entity"

I think it's a bit more real than that. The spec says the origin is "near the user's head at the time of creation", but doesn't specifically require a t=0 snapshot. An approach as in SteamVR should be compliant also, where the seated origin is a calibrated spot chosen by the user to match their preferred seated location.

Ah, the way I'm looking at these is a bit inverted -- I look at these spaces as behaving the same for 3DOF and 6DOF, however in 3DOF the application gets no positional data so pretends the headset doesn't change position (ignoring neck modeling). I do think this might be a better way to spec this if we need (eventually it would probably be useful to have notes explaining what happens for 3DOF devices in the spec) since it unifies the behavior of all spaces across devices, and you simply have to define how the viewer is tracked for various kinds of devices.

I think we're in agreement here. One way of looking at it is that the restricted spaces discard some information and effectively ignore that part of the original headset pose. For example, position-disabled ignores the position component, treating all poses as equivalent if they only differ by position. Yes, the reference space isn't static in the real world, but the only varying bit is the position which is being ignored, so in a way it's still true that the components that it cares about are static.

If you want to get fancy, you could consider position-disabled to be the equivalence class consisting of the union of all possible origins in space for a given orientation, so you'd be sticking the oriented coordinate axes on every single point in space. This way, the space does actually remain static. Not sure if that view is particularly helpful though.

@Manishearth
Copy link
Contributor Author

I think we're in agreement here

Oh, yeah, to be clear my view of this isn't incompatible with yours, either the definition of the spaces change based on headset limitations, or what we look at as the coordinates of the "viewer" when defining spaces changes based on limitations.

I was suggesting to use the latter as it lets us consolidate all the device differences (inline, 3dof, 6dof) into a single definition of "the tracked viewer position", which can then be used by all the other XRSpaces. But that's a minor thing.

@klausw
Copy link
Contributor

klausw commented Apr 2, 2019

FYI, I filed #579 "Clarify assumptions about stationary reference space movement" which is somewhat related to this.

@Manishearth
Copy link
Contributor Author

A piece of this story is #565 , where we're discussing if getPose(identity, NotViewerSpace) is even something that makes sense. In this case then identity can be defined purely as its behavior in getViewerSpace() and as returning null elsewhere.

@toji toji added the fixed by pending PR A PR that is in review will resolve this issue. label Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed by pending PR A PR that is in review will resolve this issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants