Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change UI incentives to focus on interoperability, not test pass rate #39

Closed
lukebjerring opened this issue Apr 10, 2018 · 21 comments
Closed

Comments

@lukebjerring
Copy link
Contributor

From @jeffcarp on August 16, 2017 20:56

According to http://wpt.fyi/about, the stated purpose of the WPT Dashboard is:

to promote viewing the web platform as one entity and to make identifying and fixing interoperability issues as easy as possible.

However, the way the UI works today explicitly rewards passing tests over failing tests by displaying green for 100% passing results and shades of red for anything else.[1] If a browser came along and magically made all their tests 100% green, that wouldn't entirely satisfy the goal of platform predictability.

Ideally, as I understand the goals, the "opinion" of the dashboard UI should be:

  • Tests on all platforms passing: GOOD
  • Tests on all platforms failing: OK
  • Tests on two platforms passing, other two failing: BAD

My concrete suggestions are:

  1. Move away from using the colors green and red with test results. To maintain the ability to quickly glance and see passing vs. failing tests, we could map test passing percentage to a shade of blue on a linear scale
  2. Calculate the standard deviation of test results per directory and per test file (normalized for number of total subtests) and use red & green colors to reward directories that have a low deviation.[2] We can also highlight rows more prominently that have a high deviation and therefore need more interop focus.

I have a demo of this up here: http://sigma-dot-wptdashboard.appspot.com/

screenshot from 2017-08-16 13 46 29

[1] The code that determines the color based on pass rate lives at components/wpt-results.html#L320
[2] The green=good red=bad connotation applies only in Western cultures, however I can't think of a better alternative

Copied from original issue: web-platform-tests/results-collection#83

@lukebjerring
Copy link
Contributor Author

From @drufball on August 17, 2017 17:57

Amazing - LOVE this idea. Even just briefly scanning through the demo the sigma's were highliting good areas of focus.

@lukebjerring
Copy link
Contributor Author

From @gsnedders on August 18, 2017 15:5

One option is to show something derived from per-directory (presumably pre-computed) interop data, and don't do anything based on pure pass/fail data.

@lukebjerring
Copy link
Contributor Author

From @jgraham on August 18, 2017 17:9

So I love the approach here, but it feels like the implementation isn't perfect yet; some things are easy to see in the red/green colour scheme are obscured and not all of them are harmful. Maybe it's worth spending some time thinking about all the use cases. Some use cases I can think of:

  1. Find technologies that are widely implemented and so should be promoted as safe to use
  2. As a browser developer, find areas in which my implementation has interop issues that can be fixed
  3. As a test author, find tests that are suspiciously not passing in any browsers I believe implement the specification
  4. Find areas of the Web Platform that have interop concerns and use it to prioritise future engineering effort.

I feel like this presentation is pretty good for the last use case, but once you have made the decision to improve interop on a specific area it's less good for actually doing the work (case 2 above), because it's harder to tell which tests are actually failing. And it's hard for test authors to use to identify where tests don't pass on any implementation but should. In theory it seems like it could be good for use case 1, but green is used for all of "this has good test coverage and works well everywhere", "this has poor test coverage and so we don't know how well it works" and "this fails everywhere". I think some work is needed on the first column to disambiguate some of these cases. Also possibly to make it more than a 5-point score (all the values I saw are 0.0, 0.1, 0.2, 0.3 or 0.4; multiplying by 100 and rounding would feel like a more useful metric without changing the actual computation at all).

As a browser developer I would particularly like it to be easy to tell where there are test failures in my implementation that are passes in other implementations. Maybe that doesn't require different colours here, but it would require some way to filter down by result.

@lukebjerring
Copy link
Contributor Author

From @foolip on September 1, 2017 13:28

I didn't see this until today, pretty exciting! Just seeing the σ without reading this issue I didn't know what to make of it, but clearly you're on to something here.

What is the most useful aggregated metric, and what incentives do we want to create? Given 4 engines, I think that:

  • 4/4 pass is the best, and means that web developers can depend on the behavior under test.
  • 2/4 is usually bad, remaining in that state for long can create pain for web developers.
  • There should be an incentive to move from 2/4 to 3/4 to 4/4.
  • Going from 2/4 to 1/4 or 0/4 might be the best path to interop, but if so the pass condition should be inverted, so that the pass rate is increasing.
  • 0/4 and 1/4 need investigation, but should in the long term mean a new change to the web platform and its first implementation.

The final point makes it tricky to define a metric that doesn't at some point decrease even though all the right things are happening. However, the metric has to decrease in order to be able to later reward the steps toward full interop.

So, my best idea is to use the total number of tests (see web-platform-tests/results-collection#98) as the denominator, and sum up scores based on the "goodness" of each test:

  • 4/4 ⇒ 100
  • 3/4 ⇒ 50
  • otherwise ⇒ 0

(Could be generalized to >4 implementers.)

Then, implementers who want to improve the aggregate score should focus on the cases where they are the last failing implementation, or where they can move it from 2/4 or 3/4. Other than a disincentive to increasing the denominator, what else would be wrong with this?

@lukebjerring
Copy link
Contributor Author

From @jgraham on September 1, 2017 14:29

So I like the idea of increasing the weight attached to fixing tests that pass in multiple other implementations. I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

A problem we have here is that I don't think we know what interoperability looks like. As a thought experiment, let's say that instead of developing a metric on theoretical considerations, we decided to train an ML model to produce such a metric based on past data. In that case I don't think we would know how to create a meaningful training set, which implies we don't really know what "interoperability" looks like in this data yet. Therefore I'm wary of attaching too much weight to any specific metric.

@lukebjerring
Copy link
Contributor Author

From @foolip on September 4, 2017 8:27

I also think that we should consider supplementing autogenerated data with human derived data about the percieved completeness of different testsuites.

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

A problem we have here is that I don't think we know what interoperability looks like.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

@drufball and @RByers have had ideas about experiments along these lines, and I think we should seriously consider it, but I think having a simpler base metric would still be useful.

@lukebjerring
Copy link
Contributor Author

From @jgraham on September 4, 2017 10:8

Do you mean something like a simple percentage, by which the aggregate is scaled, so that a test suite judged to only be 30% complete can at best score 30%? That WFM, but how would we seed the data?

I was literally thinking a boolean yes/no, because like you I don't know how to get data on coverage. Mozilla can possibly provide code coverage data for gecko, but at the moment it's for all of wpt (although I could probably generate per-directory on demand), and it requries an expert to interpret it, so I don't know how helpful it is.

I think we have some idea, but no way of measuring it directly at this point. I would argue that if we had a "use counter" for every line of every spec, translated into metrics in all implementations, then each test should be weighted by how often the use counters it hits are also hit in the wild, and the test suite's coverage could also be straightforwardly measured.

I'm not sure I entirely follow, but telemetry at that level seems like a lot of effort (is anyone really going to go thorugh HTML line by line and turn every assertion into a telemetry probe) and probably privacy-sensitive since it might be possible to reconstruct browsing history from such detailed telemetry.

@lukebjerring
Copy link
Contributor Author

From @foolip on September 5, 2017 12:25

I was literally thinking a boolean yes/no

Seems simpler, how would it feed into the aggregate score, if at all?

telemetry at that level seems like a lot of effort

Yes, I don't think line-by-line telemetry is doable, I was just making the argument that we have some conceptual idea about what interoperability looks like and how to measure it. The challenge isn't so much discovering what it is, but coming up with useful approximations that can be measured.

Going back to this issue, what are the options for an aggregate metric that are worth pursuing?

@lukebjerring
Copy link
Contributor Author

From @jgraham on September 5, 2017 12:41

Seems simpler, how would it feed into the aggregate score, if at all?

I don't have a good feeling for how the details should work out; I think we would need to look at various examples with different possible approaches to see what metric ended up matching our intuition. But I would expect that a complete testsuite would be a requirement to categorise something as having good interoperability, and would increase the impact metric for bugs (i.e. browser developers would be encouraged to preferentially work on features with good interop in other implementations and a "complete" testsuite). This could perhaps just be applied as a multiplier on some underlying metric e.g. increase all the scores by a factor of 2 when the testsuite is judged complete, and set some thresholds so that a spec with an incomplete testsuite could never be marked as having good interop.

Of course it's not entirely clear how this works with living standards where the testsuite could get worse over time, although living standard + commitment to add tests with every spec change might be good enough.

@lukebjerring
Copy link
Contributor Author

From @foolip on September 16, 2017 16:8

@bobholt, as the one who will likely implement and/or maintain this, do you have any thoughts about the metric itself, or implementation details? (It seems to me that web-platform-tests/results-collection#98 might impact this a little bit.)

@lukebjerring
Copy link
Contributor Author

From @bobholt on September 21, 2017 11:39

I agree it's a hard problem. I sort of like the idea of 0/4 is okay and 4/4 is okay, but 2/4 is bad - developing a metric to show deviation from cross-platform consistency. Except that it may incentivize an early-adoptor vendor to drop support for a realatively new and highly-desired-by-developers feature rather than wait for interop.

I agree that we think we kind of know what interop looks like, but we don't really know at a data level. Compounding this is that we can't be entirely sure at this point whether a failing test is due to a failing implementation, a bug in the test, a bug in the test runner, or a bug in the way the dashboard invokes the runner without going test-by-test to figure it out. That's why the work @rwaldron and @boazsender did on https://bocoup.github.io/wpt-error-report/ is valuable - it is exposing areas of the dashboard tests that broadly fail in the same way and are good candidates for further investigation.

But getting back to it, I think we need to agree on what interop looks like away from the data (all browsers implementing? all browsers not implementing? with or without feature flags? how do we measure interop of new features when we know they'll be incompletely implemented for a period? do we set a time limit on that period?)

@lukebjerring
Copy link
Contributor Author

From @foolip on October 1, 2017 11:50

I'm preparing a presentation for https://webengineshackfest.org/ and as part of that I fiddled with devtools a bit to make a mockup for what a simple 4/3/2/1 browser-neutral view might look:
screen shot 2017-10-01 at 1 44 48 pm

Colors need a lot of tweaking of course, and we might want a 0/4 column, but I think the above wouldn't be too bad.

@lukebjerring
Copy link
Contributor Author

From @foolip on October 1, 2017 11:51

Maybe percentages would make this nicer still, but they'd mean very different things depending on the completeness of the test suites.

@lukebjerring
Copy link
Contributor Author

From @foolip on November 8, 2017 9:43

I made another mockup for another presentation:
screen shot 2017-11-08 at 1 41 59 am

@lukebjerring
Copy link
Contributor Author

From @mdittmer on December 13, 2017 13:40

Demo of proposed pass rate metrics is temporarily available at https://metrics5-dot-wptdashboard.appspot.com/metrics/

Feedback welcome! @foolip has already mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also like to add links to the equivalent results-based (rather than metrics-based) view somewhere. ATM, search in this view works a bit differently than in the results-based view. We should discuss what approach makes the most sense here. (Perhaps create a separate issue for that?)

@lukebjerring
Copy link
Contributor Author

Some quick thoughts:

  • Color intensity should probably be proportionate to browsers count,
    not the number of tests?
  • Total test count could be its own column (instead of [Passes] /
    [Total] ) everywhere
  • In the filtered path views, instead of browser-count by test-path,
    aggregated metrics could be broken in a grid by a different 2 dimension
    grid - browser-count x browser
    • e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)

On Wed, 13 Dec 2017 at 08:40 Mark Dittmer [email protected] wrote:

Demo of proposed pass rate metrics is temporarily available at
https://metrics5-dot-wptdashboard.appspot.com/metrics/

Feedback welcome! @foolip https://github.com/foolip has already
mentioned that maybe the order should be 4 / 4 down to 0 / 4. I would also
like to add links to the equivalent results-based (rather than
metrics-based) view somewhere. ATM, search in this view works a bit
differently than in the results-based view. We should discuss what approach
makes the most sense here. (Perhaps create a separate issue for that?)


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
web-platform-tests/results-collection#83 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAve5K02yyVCkpSIw3PJJU_IK0-NoDDfks5s_9PHgaJpZM4O5eAX
.

@lukebjerring
Copy link
Contributor Author

From @mdittmer on December 13, 2017 14:4

@lukebjerring I'm having trouble parsing some aspects of your recommendations, but we can chat offline.

  - e.g. (Chrome is failing 4 of the 7 tests which pass in 3/4 browsers)

I believe that any browser-specific information was an explicit non-goal for this view. The idea is to assess general interop health in dependent of "who is passing, who is failing". Another view is coming soon that shows per-browser failing tests, ordered (ascending) by number of other browsers failing (i.e., start with tests where "this is the only browser failing this test").

@lukebjerring
Copy link
Contributor Author

From @mdittmer on December 14, 2017 16:43

Just met with @foolip to discuss these comments and other thoughts.

The following changes will be applied to mdittmer/wptdashboard#3 (or earlier PR, in the case of back end changes) to improve this UI:

  • Column order: 4 / 4 down to 0 / 4
  • Sort files & folders lexographically
  • No wrap in <th>s
  • Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table
  • Navigating to a directory will not reset the query string
  • Routes for results and metrics will be swapped:
    • Results will now live a at /results/
    • Metrics will now live at /
    • Links at top will be updated accordingly

Still to sort out for mdittmer/wptdashboard#3:

  • Search strategies for results and metrics pages are different; unify them as much as possible without tanking performance

Future work on metrics (and results) web components:

  • Unified controller:
    • Sort and filter
    • (re)render table rows
    • Any "display mode" configuration (e.g., support displaying and/or sorting by percentage vs. number of passing tests)

@lukebjerring
Copy link
Contributor Author

From @foolip on December 14, 2017 21:57

No wrap in <th>s

Or always wrap, whichever you think looks better.

Link to equivalent "results" page from a "metrics" page will appear right-aligned next to test directory path above table

Yep, and this probably need to be a bit prominent.

@mdittmer
Copy link
Contributor

mdittmer commented Feb 2, 2019

@lukebjerring is this still a priority?

@mdittmer mdittmer removed their assignment Feb 2, 2019
@lukebjerring
Copy link
Contributor Author

Closing as redundant/fixed with the interop view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants