Change UI incentives to focus on interoperability, not test pass rate #83

jeffcarp · 2017-08-16T20:56:22Z

According to http://wpt.fyi/about, the stated purpose of the WPT Dashboard is:

to promote viewing the web platform as one entity and to make identifying and fixing interoperability issues as easy as possible.

However, the way the UI works today explicitly rewards passing tests over failing tests by displaying green for 100% passing results and shades of red for anything else.[1] If a browser came along and magically made all their tests 100% green, that wouldn't entirely satisfy the goal of platform predictability.

Ideally, as I understand the goals, the "opinion" of the dashboard UI should be:

Tests on all platforms passing: GOOD
Tests on all platforms failing: OK
Tests on two platforms passing, other two failing: BAD

My concrete suggestions are:

Move away from using the colors green and red with test results. To maintain the ability to quickly glance and see passing vs. failing tests, we could map test passing percentage to a shade of blue on a linear scale
Calculate the standard deviation of test results per directory and per test file (normalized for number of total subtests) and use red & green colors to reward directories that have a low deviation.[2] We can also highlight rows more prominently that have a high deviation and therefore need more interop focus.

I have a demo of this up here: http://sigma-dot-wptdashboard.appspot.com/

[1] The code that determines the color based on pass rate lives at components/wpt-results.html#L320
[2] The green=good red=bad connotation applies only in Western cultures, however I can't think of a better alternative

The text was updated successfully, but these errors were encountered:

drufball · 2017-08-17T17:57:45Z

Amazing - LOVE this idea. Even just briefly scanning through the demo the sigma's were highliting good areas of focus.

gsnedders · 2017-08-18T15:05:49Z

One option is to show something derived from per-directory (presumably pre-computed) interop data, and don't do anything based on pure pass/fail data.

jgraham · 2017-08-18T17:09:53Z

So I love the approach here, but it feels like the implementation isn't perfect yet; some things are easy to see in the red/green colour scheme are obscured and not all of them are harmful. Maybe it's worth spending some time thinking about all the use cases. Some use cases I can think of:

Find technologies that are widely implemented and so should be promoted as safe to use
As a browser developer, find areas in which my implementation has interop issues that can be fixed
As a test author, find tests that are suspiciously not passing in any browsers I believe implement the specification
Find areas of the Web Platform that have interop concerns and use it to prioritise future engineering effort.

I feel like this presentation is pretty good for the last use case, but once you have made the decision to improve interop on a specific area it's less good for actually doing the work (case 2 above), because it's harder to tell which tests are actually failing. And it's hard for test authors to use to identify where tests don't pass on any implementation but should. In theory it seems like it could be good for use case 1, but green is used for all of "this has good test coverage and works well everywhere", "this has poor test coverage and so we don't know how well it works" and "this fails everywhere". I think some work is needed on the first column to disambiguate some of these cases. Also possibly to make it more than a 5-point score (all the values I saw are 0.0, 0.1, 0.2, 0.3 or 0.4; multiplying by 100 and rounding would feel like a more useful metric without changing the actual computation at all).

As a browser developer I would particularly like it to be easy to tell where there are test failures in my implementation that are passes in other implementations. Maybe that doesn't require different colours here, but it would require some way to filter down by result.

foolip · 2017-09-01T13:28:37Z

I didn't see this until today, pretty exciting! Just seeing the σ without reading this issue I didn't know what to make of it, but clearly you're on to something here.

What is the most useful aggregated metric, and what incentives do we want to create? Given 4 engines, I think that:

4/4 pass is the best, and means that web developers can depend on the behavior under test.
2/4 is usually bad, remaining in that state for long can create pain for web developers.
There should be an incentive to move from 2/4 to 3/4 to 4/4.
Going from 2/4 to 1/4 or 0/4 might be the best path to interop, but if so the pass condition should be inverted, so that the pass rate is increasing.
0/4 and 1/4 need investigation, but should in the long term mean a new change to the web platform and its first implementation.

The final point makes it tricky to define a metric that doesn't at some point decrease even though all the right things are happening. However, the metric has to decrease in order to be able to later reward the steps toward full interop.

So, my best idea is to use the total number of tests (see #98) as the denominator, and sum up scores based on the "goodness" of each test:

4/4 ⇒ 100
3/4 ⇒ 50
otherwise ⇒ 0

(Could be generalized to >4 implementers.)

Then, implementers who want to improve the aggregate score should focus on the cases where they are the last failing implementation, or where they can move it from 2/4 or 3/4. Other than a disincentive to increasing the denominator, what else would be wrong with this?

jgraham · 2017-09-01T14:29:54Z

So I like the idea of increasing the weight attached to fixing tests that pass in multiple other implementations. I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

A problem we have here is that I don't think we know what interoperability looks like. As a thought experiment, let's say that instead of developing a metric on theoretical considerations, we decided to train an ML model to produce such a metric based on past data. In that case I don't think we would know how to create a meaningful training set, which implies we don't really know what "interoperability" looks like in this data yet. Therefore I'm wary of attaching too much weight to any specific metric.

foolip · 2017-09-04T08:27:56Z

I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

A problem we have here is that I don't think we know what interoperability looks like.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

@drufball and @RByers have had ideas about experiments along these lines, and I think we should seriously consider it, but I think having a simpler base metric would still be useful.

jgraham · 2017-09-04T10:08:45Z

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

I was literally thinking a boolean yes/no, because like you I don't know how to get data on coverage. Mozilla can possibly provide code coverage data for gecko, but at the moment it's for all of wpt (although I could probably generate per-directory on demand), and it requries an expert to interpret it, so I don't know how helpful it is.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

I'm not sure I entirely follow, but telemetry at that level seems like a lot of effort (is anyone really going to go thorugh HTML line by line and turn every assertion into a telemetry probe) and probably privacy-sensitive since it might be possible to reconstruct browsing history from such detailed telemetry.

foolip · 2017-09-05T12:25:47Z

I was literally thinking a boolean yes/no

Seems simpler, how would it feed into the aggregate score, if at all?

telemetry at that level seems like a lot of effort

Yes, I don't think line-by-line telemetry is doable, I was just making the argument that we have some conceptual idea about what interoperability looks like and how to measure it. The challenge isn't so much discovering what it is, but coming up with useful approximations that can be measured.

Going back to this issue, what are the options for an aggregate metric that are worth pursuing?

jgraham · 2017-09-05T12:41:35Z

Seems simpler, how would it feed into the aggregate score, if at all?

I don't have a good feeling for how the details should work out; I think we would need to look at various examples with different possible approaches to see what metric ended up matching our intuition. But I would expect that a complete testsuite would be a requirement to categorise something as having good interoperability, and would increase the impact metric for bugs (i.e. browser developers would be encouraged to preferentially work on features with good interop in other implementations and a "complete" testsuite). This could perhaps just be applied as a multiplier on some underlying metric e.g. increase all the scores by a factor of 2 when the testsuite is judged complete, and set some thresholds so that a spec with an incomplete testsuite could never be marked as having good interop.

Of course it's not entirely clear how this works with living standards where the testsuite could get worse over time, although living standard + commitment to add tests with every spec change might be good enough.

foolip · 2017-09-16T16:08:05Z

@bobholt, as the one who will likely implement and/or maintain this, do you have any thoughts about the metric itself, or implementation details? (It seems to me that #98 might impact this a little bit.)

bobholt · 2017-09-21T11:39:59Z

I agree it's a hard problem. I sort of like the idea of 0/4 is okay and 4/4 is okay, but 2/4 is bad - developing a metric to show deviation from cross-platform consistency. Except that it may incentivize an early-adoptor vendor to drop support for a realatively new and highly-desired-by-developers feature rather than wait for interop.

I agree that we think we kind of know what interop looks like, but we don't really know at a data level. Compounding this is that we can't be entirely sure at this point whether a failing test is due to a failing implementation, a bug in the test, a bug in the test runner, or a bug in the way the dashboard invokes the runner without going test-by-test to figure it out. That's why the work @rwaldron and @boazsender did on https://bocoup.github.io/wpt-error-report/ is valuable - it is exposing areas of the dashboard tests that broadly fail in the same way and are good candidates for further investigation.

But getting back to it, I think we need to agree on what interop looks like away from the data (all browsers implementing? all browsers not implementing? with or without feature flags? how do we measure interop of new features when we know they'll be incompletely implemented for a period? do we set a time limit on that period?)

foolip · 2017-10-01T11:50:02Z

I'm preparing a presentation for https://webengineshackfest.org/ and as part of that I fiddled with devtools a bit to make a mockup for what a simple 4/3/2/1 browser-neutral view might look:

Colors need a lot of tweaking of course, and we might want a 0/4 column, but I think the above wouldn't be too bad.

foolip · 2017-10-01T11:51:39Z

Maybe percentages would make this nicer still, but they'd mean very different things depending on the completeness of the test suites.

foolip · 2017-11-08T09:43:19Z

I made another mockup for another presentation:

mdittmer · 2017-12-13T13:40:22Z

Demo of proposed pass rate metrics is temporarily available at https://metrics5-dot-wptdashboard.appspot.com/metrics/

Feedback welcome! @foolip has already mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also like to add links to the equivalent results-based (rather than metrics-based) view somewhere. ATM, search in this view works a bit differently than in the results-based view. We should discuss what approach makes the most sense here. (Perhaps create a separate issue for that?)

lukebjerring · 2017-12-13T13:49:09Z

Some quick thoughts: - Color intensity should probably be proportionate to browsers count, not the number of tests? - Total test count could be its own column (instead of [Passes] / [Total] ) everywhere - In the filtered path views, instead of browser-count by test-path, aggregated metrics could be broken in a grid by a different 2 dimension grid - browser-count x browser - e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)

…

On Wed, 13 Dec 2017 at 08:40 Mark Dittmer ***@***.***> wrote: Demo of proposed pass rate metrics is temporarily available at https://metrics5-dot-wptdashboard.appspot.com/metrics/ Feedback welcome! @foolip <https://github.com/foolip> has already mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also like to add links to the equivalent results-based (rather than metrics-based) view somewhere. ATM, search in this view works a bit differently than in the results-based view. We should discuss what approach makes the most sense here. (Perhaps create a separate issue for that?) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#83 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAve5K02yyVCkpSIw3PJJU_IK0-NoDDfks5s_9PHgaJpZM4O5eAX> .

mdittmer · 2017-12-13T14:04:17Z

@lukebjerring I'm having trouble parsing some aspects of your recommendations, but we can chat offline.

  - e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)

I believe that any browser-specific information was an explicit non-goal for this view. The idea is to assess general interop health in dependent of "who is passing, who is failing". Another view is coming soon that shows per-browser failing tests, ordered (ascending) by number of other browsers failing (i.e., start with tests where "this is the only browser failing this test").

mdittmer · 2017-12-14T16:43:14Z

Just met with @foolip to discuss these comments and other thoughts.

The following changes will be applied to mdittmer#3 (or earlier PR, in the case of back end changes) to improve this UI:

Column order: 4 / 4 down to 0 / 4
Sort files & folders lexographically
No wrap in <th>s
Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table
Navigating to a directory will not reset the query string
Routes for results and metrics will be swapped:
- Results will now live a at /results/
- Metrics will now live at /
- Links at top will be updated accordingly

Still to sort out for mdittmer#3:

Search strategies for results and metrics pages are different; unify them as much as possible without tanking performance

Future work on metrics (and results) web components:

Unified controller:
- Sort and filter
- (re)render table rows
- Any "display mode" configuration (e.g., support displaying and/or sorting by percentage vs. number of passing tests)

foolip · 2017-12-14T21:57:36Z

No wrap in <th>s

Or always wrap, whichever you think looks better.

Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table

Yep, and this probably need to be a bit prominent.

lukebjerring · 2018-04-10T19:14:23Z

This issue was moved to web-platform-tests/wpt.fyi#39

jeffcarp added discussion front-end labels Aug 16, 2017

domenic mentioned this issue Aug 25, 2017

Mismatched total test numbers are confusing #98

Open

This was referenced Oct 3, 2017

Enable finding the set of tests that pass in 3 or greater browsers #135

Closed

Need to add a bunch of specs. How? tobie/specref#397

Closed

Feature Request: Disregard or elsehow mark tentative tests. #99

Closed

foolip assigned mdittmer Jan 27, 2018

foolip mentioned this issue Jan 27, 2018

Remove radiusY and rotation from arcTo() method whatwg/html#3371

Merged

foolip mentioned this issue Feb 5, 2018

Pass/total numbers inflated by "Test file" for testharness.js tests (harness status) #427

Closed

jugglinmike added the project:wpt.fyi label Apr 5, 2018

lukebjerring mentioned this issue Apr 10, 2018

Enable finding the set of tests that pass in 3 or greater browsers web-platform-tests/wpt.fyi#28

Closed

This was referenced Apr 10, 2018

Feature Request: Disregard or elsehow mark tentative tests. web-platform-tests/wpt.fyi#36

Closed

Change UI incentives to focus on interoperability, not test pass rate web-platform-tests/wpt.fyi#39

Closed

lukebjerring closed this as completed Apr 10, 2018

foolip mentioned this issue Apr 17, 2018

Pass/total numbers inflated by "Test file" for testharness.js tests (harness status) web-platform-tests/wpt.fyi#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change UI incentives to focus on interoperability, not test pass rate #83

Change UI incentives to focus on interoperability, not test pass rate #83

jeffcarp commented Aug 16, 2017

drufball commented Aug 17, 2017

gsnedders commented Aug 18, 2017

jgraham commented Aug 18, 2017

foolip commented Sep 1, 2017

jgraham commented Sep 1, 2017

foolip commented Sep 4, 2017 •

edited

Loading

jgraham commented Sep 4, 2017

foolip commented Sep 5, 2017

jgraham commented Sep 5, 2017

foolip commented Sep 16, 2017

bobholt commented Sep 21, 2017

foolip commented Oct 1, 2017 •

edited

Loading

foolip commented Oct 1, 2017

foolip commented Nov 8, 2017

mdittmer commented Dec 13, 2017

lukebjerring commented Dec 13, 2017 via email

mdittmer commented Dec 13, 2017

mdittmer commented Dec 14, 2017

foolip commented Dec 14, 2017

lukebjerring commented Apr 10, 2018

Change UI incentives to focus on interoperability, not test pass rate #83

Change UI incentives to focus on interoperability, not test pass rate #83

Comments

jeffcarp commented Aug 16, 2017

drufball commented Aug 17, 2017

gsnedders commented Aug 18, 2017

jgraham commented Aug 18, 2017

foolip commented Sep 1, 2017

jgraham commented Sep 1, 2017

foolip commented Sep 4, 2017 • edited Loading

jgraham commented Sep 4, 2017

foolip commented Sep 5, 2017

jgraham commented Sep 5, 2017

foolip commented Sep 16, 2017

bobholt commented Sep 21, 2017

foolip commented Oct 1, 2017 • edited Loading

foolip commented Oct 1, 2017

foolip commented Nov 8, 2017

mdittmer commented Dec 13, 2017

lukebjerring commented Dec 13, 2017 via email

mdittmer commented Dec 13, 2017

mdittmer commented Dec 14, 2017

foolip commented Dec 14, 2017

lukebjerring commented Apr 10, 2018

foolip commented Sep 4, 2017 •

edited

Loading

foolip commented Oct 1, 2017 •

edited

Loading