Skip to content
This repository has been archived by the owner on Nov 6, 2019. It is now read-only.

Change UI incentives to focus on interoperability, not test pass rate #83

Closed
jeffcarp opened this issue Aug 16, 2017 · 20 comments
Closed

Comments

@jeffcarp
Copy link
Contributor

According to http://wpt.fyi/about, the stated purpose of the WPT Dashboard is:

to promote viewing the web platform as one entity and to make identifying and fixing interoperability issues as easy as possible.

However, the way the UI works today explicitly rewards passing tests over failing tests by displaying green for 100% passing results and shades of red for anything else.[1] If a browser came along and magically made all their tests 100% green, that wouldn't entirely satisfy the goal of platform predictability.

Ideally, as I understand the goals, the "opinion" of the dashboard UI should be:

  • Tests on all platforms passing: GOOD
  • Tests on all platforms failing: OK
  • Tests on two platforms passing, other two failing: BAD

My concrete suggestions are:

  1. Move away from using the colors green and red with test results. To maintain the ability to quickly glance and see passing vs. failing tests, we could map test passing percentage to a shade of blue on a linear scale
  2. Calculate the standard deviation of test results per directory and per test file (normalized for number of total subtests) and use red & green colors to reward directories that have a low deviation.[2] We can also highlight rows more prominently that have a high deviation and therefore need more interop focus.

I have a demo of this up here: http://sigma-dot-wptdashboard.appspot.com/

screenshot from 2017-08-16 13 46 29

[1] The code that determines the color based on pass rate lives at components/wpt-results.html#L320
[2] The green=good red=bad connotation applies only in Western cultures, however I can't think of a better alternative

@drufball
Copy link

Amazing - LOVE this idea. Even just briefly scanning through the demo the sigma's were highliting good areas of focus.

@gsnedders
Copy link
Member

One option is to show something derived from per-directory (presumably pre-computed) interop data, and don't do anything based on pure pass/fail data.

@jgraham
Copy link
Collaborator

jgraham commented Aug 18, 2017

So I love the approach here, but it feels like the implementation isn't perfect yet; some things are easy to see in the red/green colour scheme are obscured and not all of them are harmful. Maybe it's worth spending some time thinking about all the use cases. Some use cases I can think of:

  1. Find technologies that are widely implemented and so should be promoted as safe to use
  2. As a browser developer, find areas in which my implementation has interop issues that can be fixed
  3. As a test author, find tests that are suspiciously not passing in any browsers I believe implement the specification
  4. Find areas of the Web Platform that have interop concerns and use it to prioritise future engineering effort.

I feel like this presentation is pretty good for the last use case, but once you have made the decision to improve interop on a specific area it's less good for actually doing the work (case 2 above), because it's harder to tell which tests are actually failing. And it's hard for test authors to use to identify where tests don't pass on any implementation but should. In theory it seems like it could be good for use case 1, but green is used for all of "this has good test coverage and works well everywhere", "this has poor test coverage and so we don't know how well it works" and "this fails everywhere". I think some work is needed on the first column to disambiguate some of these cases. Also possibly to make it more than a 5-point score (all the values I saw are 0.0, 0.1, 0.2, 0.3 or 0.4; multiplying by 100 and rounding would feel like a more useful metric without changing the actual computation at all).

As a browser developer I would particularly like it to be easy to tell where there are test failures in my implementation that are passes in other implementations. Maybe that doesn't require different colours here, but it would require some way to filter down by result.

@foolip
Copy link
Member

foolip commented Sep 1, 2017

I didn't see this until today, pretty exciting! Just seeing the σ without reading this issue I didn't know what to make of it, but clearly you're on to something here.

What is the most useful aggregated metric, and what incentives do we want to create? Given 4 engines, I think that:

  • 4/4 pass is the best, and means that web developers can depend on the behavior under test.
  • 2/4 is usually bad, remaining in that state for long can create pain for web developers.
  • There should be an incentive to move from 2/4 to 3/4 to 4/4.
  • Going from 2/4 to 1/4 or 0/4 might be the best path to interop, but if so the pass condition should be inverted, so that the pass rate is increasing.
  • 0/4 and 1/4 need investigation, but should in the long term mean a new change to the web platform and its first implementation.

The final point makes it tricky to define a metric that doesn't at some point decrease even though all the right things are happening. However, the metric has to decrease in order to be able to later reward the steps toward full interop.

So, my best idea is to use the total number of tests (see #98) as the denominator, and sum up scores based on the "goodness" of each test:

  • 4/4 ⇒ 100
  • 3/4 ⇒ 50
  • otherwise ⇒ 0

(Could be generalized to >4 implementers.)

Then, implementers who want to improve the aggregate score should focus on the cases where they are the last failing implementation, or where they can move it from 2/4 or 3/4. Other than a disincentive to increasing the denominator, what else would be wrong with this?

@jgraham
Copy link
Collaborator

jgraham commented Sep 1, 2017

So I like the idea of increasing the weight attached to fixing tests that pass in multiple other implementations. I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

A problem we have here is that I don't think we know what interoperability looks like. As a thought experiment, let's say that instead of developing a metric on theoretical considerations, we decided to train an ML model to produce such a metric based on past data. In that case I don't think we would know how to create a meaningful training set, which implies we don't really know what "interoperability" looks like in this data yet. Therefore I'm wary of attaching too much weight to any specific metric.

@foolip
Copy link
Member

foolip commented Sep 4, 2017

I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

A problem we have here is that I don't think we know what interoperability looks like.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

@drufball and @RByers have had ideas about experiments along these lines, and I think we should seriously consider it, but I think having a simpler base metric would still be useful.

@jgraham
Copy link
Collaborator

jgraham commented Sep 4, 2017

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

I was literally thinking a boolean yes/no, because like you I don't know how to get data on coverage. Mozilla can possibly provide code coverage data for gecko, but at the moment it's for all of wpt (although I could probably generate per-directory on demand), and it requries an expert to interpret it, so I don't know how helpful it is.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

I'm not sure I entirely follow, but telemetry at that level seems like a lot of effort (is anyone really going to go thorugh HTML line by line and turn every assertion into a telemetry probe) and probably privacy-sensitive since it might be possible to reconstruct browsing history from such detailed telemetry.

@foolip
Copy link
Member

foolip commented Sep 5, 2017

I was literally thinking a boolean yes/no

Seems simpler, how would it feed into the aggregate score, if at all?

telemetry at that level seems like a lot of effort

Yes, I don't think line-by-line telemetry is doable, I was just making the argument that we have some conceptual idea about what interoperability looks like and how to measure it. The challenge isn't so much discovering what it is, but coming up with useful approximations that can be measured.

Going back to this issue, what are the options for an aggregate metric that are worth pursuing?

@jgraham
Copy link
Collaborator

jgraham commented Sep 5, 2017

Seems simpler, how would it feed into the aggregate score, if at all?

I don't have a good feeling for how the details should work out; I think we would need to look at various examples with different possible approaches to see what metric ended up matching our intuition. But I would expect that a complete testsuite would be a requirement to categorise something as having good interoperability, and would increase the impact metric for bugs (i.e. browser developers would be encouraged to preferentially work on features with good interop in other implementations and a "complete" testsuite). This could perhaps just be applied as a multiplier on some underlying metric e.g. increase all the scores by a factor of 2 when the testsuite is judged complete, and set some thresholds so that a spec with an incomplete testsuite could never be marked as having good interop.

Of course it's not entirely clear how this works with living standards where the testsuite could get worse over time, although living standard + commitment to add tests with every spec change might be good enough.

@foolip
Copy link
Member

foolip commented Sep 16, 2017

@bobholt, as the one who will likely implement and/or maintain this, do you have any thoughts about the metric itself, or implementation details? (It seems to me that #98 might impact this a little bit.)

@bobholt
Copy link
Contributor

bobholt commented Sep 21, 2017

I agree it's a hard problem. I sort of like the idea of 0/4 is okay and 4/4 is okay, but 2/4 is bad - developing a metric to show deviation from cross-platform consistency. Except that it may incentivize an early-adoptor vendor to drop support for a realatively new and highly-desired-by-developers feature rather than wait for interop.

I agree that we think we kind of know what interop looks like, but we don't really know at a data level. Compounding this is that we can't be entirely sure at this point whether a failing test is due to a failing implementation, a bug in the test, a bug in the test runner, or a bug in the way the dashboard invokes the runner without going test-by-test to figure it out. That's why the work @rwaldron and @boazsender did on https://bocoup.github.io/wpt-error-report/ is valuable - it is exposing areas of the dashboard tests that broadly fail in the same way and are good candidates for further investigation.

But getting back to it, I think we need to agree on what interop looks like away from the data (all browsers implementing? all browsers not implementing? with or without feature flags? how do we measure interop of new features when we know they'll be incompletely implemented for a period? do we set a time limit on that period?)

@foolip
Copy link
Member

foolip commented Oct 1, 2017

I'm preparing a presentation for https://webengineshackfest.org/ and as part of that I fiddled with devtools a bit to make a mockup for what a simple 4/3/2/1 browser-neutral view might look:
screen shot 2017-10-01 at 1 44 48 pm

Colors need a lot of tweaking of course, and we might want a 0/4 column, but I think the above wouldn't be too bad.

@foolip
Copy link
Member

foolip commented Oct 1, 2017

Maybe percentages would make this nicer still, but they'd mean very different things depending on the completeness of the test suites.

@foolip
Copy link
Member

foolip commented Nov 8, 2017

I made another mockup for another presentation:
screen shot 2017-11-08 at 1 41 59 am

@mdittmer
Copy link
Collaborator

Demo of proposed pass rate metrics is temporarily available at https://metrics5-dot-wptdashboard.appspot.com/metrics/

Feedback welcome! @foolip has already mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also like to add links to the equivalent results-based (rather than metrics-based) view somewhere. ATM, search in this view works a bit differently than in the results-based view. We should discuss what approach makes the most sense here. (Perhaps create a separate issue for that?)

@lukebjerring
Copy link
Collaborator

lukebjerring commented Dec 13, 2017 via email

@mdittmer
Copy link
Collaborator

@lukebjerring I'm having trouble parsing some aspects of your recommendations, but we can chat offline.

  - e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)

I believe that any browser-specific information was an explicit non-goal for this view. The idea is to assess general interop health in dependent of "who is passing, who is failing". Another view is coming soon that shows per-browser failing tests, ordered (ascending) by number of other browsers failing (i.e., start with tests where "this is the only browser failing this test").

@mdittmer
Copy link
Collaborator

Just met with @foolip to discuss these comments and other thoughts.

The following changes will be applied to mdittmer#3 (or earlier PR, in the case of back end changes) to improve this UI:

  • Column order: 4 / 4 down to 0 / 4
  • Sort files & folders lexographically
  • No wrap in <th>s
  • Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table
  • Navigating to a directory will not reset the query string
  • Routes for results and metrics will be swapped:
    • Results will now live a at /results/
    • Metrics will now live at /
    • Links at top will be updated accordingly

Still to sort out for mdittmer#3:

  • Search strategies for results and metrics pages are different; unify them as much as possible without tanking performance

Future work on metrics (and results) web components:

  • Unified controller:
    • Sort and filter
    • (re)render table rows
    • Any "display mode" configuration (e.g., support displaying and/or sorting by percentage vs. number of passing tests)

@foolip
Copy link
Member

foolip commented Dec 14, 2017

No wrap in <th>s

Or always wrap, whichever you think looks better.

Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table

Yep, and this probably need to be a bit prominent.

@lukebjerring
Copy link
Collaborator

This issue was moved to web-platform-tests/wpt.fyi#39

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants