-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DZL Tracking Issue #6775
Comments
Alright folks, if you're like me, you've found DZL results not very useful. They're interesting, but not really actionable. Every time it fails something I assume it wasn't really the PR and when it passes it doesn't really give me confidence that the PR is safe. Not a resounding success :/ Instead of adding features, I've spent the past (slow holiday) month or so when working on DZL playing around with different URLs and intentionally regressive LH versions to suss out what's going on. Here's a general dump of what I've learned and what I think that means for LH. tl;dr - we need to isolate and surface every aspect of variance and their sourcesWe must separate "good", "bad", and "unnecessary" varianceThere's "good", "bad", and "unnecessary" variance. "Good variance" is the variance due to changes in the user experience that were within the developer's control, i.e. resources changed and payloads got heavier, scripts got less efficient and took more time executing, etc. "Bad variance" is the variance due to changes in the user experience that were mostly outside the developer's control, i.e. connection speed changes, random spikes in server traffic that slow responses, etc. "Unnecessary variance" is a change in our metrics that does not track any change in the user experience. I'm not currently aware of any such variance, but it's worth calling out and eliminating if we find it :) Metric graphs without page vitals are meaninglessEvery time I see a graph of a FCP that's different, I want to know why it's different. Were there different resources loaded? Were the same resources just larger? Did our network analysis estimate different RTTs/server latency to the origins? Did the CPU tasks take longer? Why? Without this answer, I have no reason to believe the implementation changed or even had anything to do with the difference. To me, a page vital is anything that lantern uses in its estimates.
Any variation in performance metrics basically comes down to one of these things varying, so measuring each of these is going to be critical. We can then define the success of our implementation as multiples of the variance of these underlying sources. We are tracking too many metrics for p-values to be meaningfulThis one we kinda knew going in, but I didn't realize how powerful it would actually be. In the standard set of DZL runs, we're looking at 10 URLs with 243 timings and 119 audits. That's Decreasing the set of metrics we're observing for changes actually makes a lot of sense. If we think a few particular metrics are likely to change with a given PR, we can compare those few and the mechanics of p-value testing start to hold once again. We can still monitor other metrics for curiosity and exploration, but shouldn't read into any of them changing. We need few metrics, identical environments, and stable URLs to PR-gate LHAll the above have strong implications for gating PRs. We'll need to select a narrow set of criteria that we're concerned about and ensure that all page vitals are similar, before failing a PR. This generally means identical environments and stable URLs whose code we control or at least changes infrequently with little non-determinism. Analyzing/tracking real-world variance and getting an effective LH regression-testing system are two very different problemsWhen we take a look at all the variance we need to eliminate before reaching the conclusion that it was a fault of a code change, we're eliminating lots of things that are going to be encountered in the real-world. These are two very different use cases, and I think we need to accept that fact that it will take different strategies and potentially even different tools to solve both of these problems. I think this is all I've got for now, but I've updated this issue with some of the action items I've mentioned here and we can discuss more at the inaugural variance meeting :) |
In our last variance meeting we discussed the future of DZL and things we want it to do. @brendankenny said...
Thoughts: Will have to explore more what easily accessible DZL results should look like :) We currently have the results link commented on any PR that has DZL enabled which isn't the most elegant, but this may have been referring more to the consumability of what lies at the end of the link. @paulirish said...
Thoughts: IMO, this is actually the only thing that DZL does well at the moment :) The issue for PRs is that we test on such a small basket of sites that my confidence level is not very high. The obvious solution to me here is radically increase our basket of sites. It does not seem like the speed with which DZL returns results has been the main problem so far, and I'd obviously rather them be useful but slow than fast but unhelpful.
Thoughts: This and the previous request by Paul are what I see as the most promising for DZL's future. DZL was super helpful for reproing and finding URLs for the m74 series of issues. It could easily be modified to continuously test Chromium versions against our master and alert on error rate and performance changes.
Thoughts: IMO, this is a very similar goal to Lighthouse CI, so if this is one of our focuses I'd want to invest my time in CI where it'll benefit all Lighthouse users rather than some internal infra specific. |
I mostly meant auto running on every PR (if that's feasible) and having a big old link when it's done :) Not sure if we'd also want some kind of status posted, or if clicking through is sufficient. I feel like none of us ever clicks through on our statuses unless something is broken or we're expecting something interesting, like the deploy links for PRs changing the report.
👍 👍 |
I'm not sure why this is still open :) We have halted most effort here for over a year. We have some scripts like GCP lantern collection that could help revive one-off comparisons as we need them but as a general CI practice I think this is abandoned. |
See #6152 for historical discussion and DZL's purpose. Comment or edit this issue to add feature requests.
Features
Bugs
?
The text was updated successfully, but these errors were encountered: