DZL Proposal and Variance Measurement #6152

patrickhulce · 2018-10-01T21:01:07Z

I've been doing some thinking and background research on what's out there. I'm taking down my thoughts here as a sounding board so we can narrow down to an MVP.

The Problem

LH team needs to be able to understand metric variance and overall runtime/latency (bonus if possible: accuracy) in different environments, how changes we make are affecting these attributes, and how we are trending over time.

Recap

Need to monitor:

Metric variance
Overall LH runtime
(maybe if possible) Accuracy to real phones

Across:

Environment (i.e. LR vs. local vs. travis/cloud/whatever)
Different site types/URLs (i.e. example.com-type vs. cnn.com-type)
Throttling types (i.e. Lantern vs. DevTools vs. WPT vs. none)

Use Cases:

Overall "health dashboard" i.e. what does master look like overall?
Compare version A to version B i.e. does this change improve LH?
Timeline view by commit i.e. are we suffering from a death by a thousand cuts over time?

Potential Solution Components

Mechanism for running LH n times in a particular environment on given URLs and storing the results in some queryable format
Mechanism for visualizing all the latest master results
Mechanism for visualizing the difference between two different versions of LH
Mechanism for visualizing the history of master results

Existing Solutions

The good news: we have an awesome community that has built lots of things to look at LH results over time :)
The bad news: their big selling points usually revolve around ease of time series data and abstracting away the environment concerns (which is the one piece we will actually need to change up and have control over the most) :/

Only one of the use cases here is really a timeseries problem (and even then it's not a real-time timeseries, it's commit level timeseries). That's not to say we can't repurpose a timeseries DB for our use cases, graphana still supports histograms and all that, it just is a bit of shoehorn for some of the things we'll want to do.

Other problem, one of the big things we actually care about most in all of this is differences between versions of Lighthouse. Given that abstracting the environment away and keeping it stable is a selling point of all these solutions, breaking in to make comparing versions our priority is really cutting against the grain. Again, not impossible, but not exactly leveraging the strengths of these solutions.

Proposed MVP

K-I-S-S, keep it simple stupid. Great advice, hurts my feelings every time.

Simple CLI with 2 commands.

run - handle the run n times and save piece, single js file for each connector we need to run, just local and LR to start
serve - serve a site that enables the visualization pieces

These two commands share a CLI config that specifies storage location. I'm thinking sqlite to start to avoid any crazy docker mess and work with some hypothetical remote SQL server. We can include a field for the raw response so we can always add more columns easily later.

Thoughts so far? Did I completely miss what the pain is from others' perspective? Does it sound terrifyingly similar to plots 😱

The text was updated successfully, but these errors were encountered:

patrickhulce · 2018-10-04T20:06:30Z

I didn't explicitly make my case for why I think going with a time series database won't help us very much, and I realize I didn't put up any of my scribbled drawings up either, so here goes a cleaned up version of what I imagined.

For our "health dashboard" at a minimum I'm thinking we need to see

Overall variance on a basket of pages
Overall run time on a basket of pages
Variance in as controlled conditions as possible (
Variance on a thornier live page
Run time on a typical page
Run time on a thornier live page

None of these are time series questions and all involve doing mean/median/std. dev on a collection of runs matching some git hash/version ID/run ID

We want to be able to jump around hashes and easily compare the same version of LH in multiple environments (which is non-trivial with LR to CLI since they will very rarely line up and need to do some nearest git neighbor junk).

All this gave me the idea of something like the below.

Bad drawings aside, does the bulleted list sound way off to other folks?

paulirish · 2018-10-05T20:17:07Z

I like everything here in this second comment and agree these will help to illuminate things.

But I also think variance and runtime plotted over our commits would be really valuable so we can see how our fixes improve things. E.g. "Our median run time was 30s last month and now its 14s, as you can see. and our 90 pct run time dropped even farther, mostly due to this commit."

patrickhulce · 2018-10-05T20:35:00Z

But I also think variance and runtime plotted over our commits would be really valuable so we can see how our fixes improve things. E.g. "Our median run time was 30s last month and now its 14s, as you can see. and our 90 pct run time dropped even farther, mostly due to this commit."

totally agree plotted over commits would be great 👍

This is likely just my timeseries db incompetence, but it always appeared to me that there's not a 'bucket by hash in custom sorted order' style chart that can reuse all the timeseries magic, it always wants timestamps. By that token, I think that a graph by time will lose a decent bit of signal that a graph by commit would show. I will keep digging though.

patrickhulce · 2018-10-09T04:01:32Z

Update here based on my experiences trying to get this to work with InfluxDB + Grafana:

It gets us 80% of the way there really quickly. There's definitely a substantial amount of fighting with the timeseries-nature going on. Most of it has workarounds with extra management/planning, but I think it's unlikely we can satisfy 100% of the use cases outlined in initial comment. Maybe that's fine and we're happy with the trade-off, but I'll outline what I found the main challenges to be.

Query choices are limited

Example: there is no GROUP BY x HAVING y-style clause which would have helped in a few cases. This isn't a huge deal given we plan ahead and are flexible with how we look at the output. I was able to workaround this limitation with preprocessing the points from each URL in a run + using a table for output instead of a single cell.

Points are downsampled to particular timestamp

Workaround, we fake timestamps for each datapoint so they're unique. Limitation is super minor, just no running different jobs concurrently or risk losing data.

Histograms don't want to represent every data point, and no unit control

This one feels like I have to be doing something wrong. Even though I've removed every group by and time-related grouping option I can find, histograms still don't always want to represent every data point. Strangest part: changing the # of buckets changes the total count every time (?????)

"Time series" by build is clearly not a first-class visualization option

I found a workaround to treat each build like a next data point in a time series, but the visualization options become very limited using this approach and will not scale with many different builds. Example of limitation: it's bar chart only, no line graphs or variance bars, cannot control the units used, etc. Overall it feels like we'd give up on this and just use an uneven time series.

Overall I think we can live with these or build enough tools around them to limit their impact, but it does mean a lot of use cases will become more difficult, i.e. comparing two different builds will always be manually selecting their respective hashes and presenting their reports side-by-side instead of some sort of custom in-line diff. Previously folks seemed to think these limitations were worth the tradeoff.

WDYT?

wardpeet · 2018-10-09T21:09:11Z

Grafana is sadly great for timeseries but not for other data. Setting it up and making dashboards look really easy 💯 but maybe it's not the stack we need?

If we don't have timeseries data we might want to look to the ELK stack? (elastic + kibana) I'm pretty sure they support line graphs with a custom x axis. The downside is that you need elastic as a data source so we can't have a sqlite or whatever. (http://logz.io has a free trial we might want to use to test)

Another way is to create some custom graphs ourselves but that defeats the purpose of K-I-S-S and is something we don't really want to manage.

patrickhulce · 2018-10-09T21:40:50Z

One more example WTF with the histograms, here's a side-by-side of changing absolutely nothing but the number of buckets on the X-axis.

5 buckets

10 buckets (impossible for there to be only 1 data point below 0)

paulirish · 2018-10-09T23:29:19Z

OK. I think i'm sufficiently convinced that these tools optimize for metrics happening continuously and requiring grouping. Our usecase of discrete metrics every hour or day isn't supported well by them. (your 3rd bold point).

I appreciate the attempt to make it work, but agree that grafana isn't a great solution for what we're trying to do here.

wardpeet · 2018-10-10T07:03:37Z

The ELK stack is not a big difference. I could set it up on a private instance and give you guys access to import some data and make the correct graphs. Than you don't have to be bothered to set up the stack.

patrickhulce · 2018-10-10T14:48:46Z

OK, so I think we're all on the same page about timeseries solutions not being the best for us. Before jumping into the next attempt, I want to get some clarity on what exactly we want to show up in the dashboard.

This is what the grafana one looked like:

It surfaced

Mean run time over time
Histogram of all run time samples
Mean run time of all samples
95th percentile run time of all samples
Mean run time for a particular good URL (example.com)
Mean run time for a particular bad URL (sfgate.com)
Table of mean run time by URL
Mean TTI std dev over time
Histogram of all TTI mean difference samples
Mean FCP std dev
Mean TTI std dev
Mean SI std dev
95th percentile TTI mean difference of all samples
TTI std dev for a particular good URL (example.com)
TTI std dev for a particular bad URL (theverge.com)
Table of TTI std dev by URL

@brendankenny you mentioned you had different metrics in mind
@exterkamp you mentioned you really wanted a line graph for over time graph

Any more other feedback on this set of metrics before I go off?

exterkamp · 2018-10-12T20:39:15Z

Hey, so yeah, I was thinking that I want to see data commit-over-commit so that I could see if a specific commit has introduced a problem, or that we can see that a variance has been reduced.

I am still liking the idea of a candlestick graph with the data like this:

x-axis: commit hash identifying the point in time the run was, maybe even LH version might be better, or master commit hash, or varying resolution might be nice
each candle would be derived from all runs for that commit hash. I imagine making the candle from taking all the runs and finding 1 std dev from the mean and plotting that
- "open": mean minus 1 std dev
- "close": mean plus 1 std dev
- "high": the highest value of the metric for any of the runs in that commit
- "low": the lowest value of the metric for any of the runs in that commit

This would allow us to visualize when the variance was narrowing i.e. the std dev would be going down over time:

Or if a specific commit increased variance:

So that is kind of how I like to visualize the scores over time, either with a candlestick chart, or with a line chart + shaded area of +/- 1-2 std dev around it to show the variance.

I like the current visualizations esp. broken down by URL. But personally I want to see line charts/candlestick charts that show me what each metric is doing over time so that I can see if something is getting out of hand over time or degrading slowly. But for snapshots I like all the called out percentages and variance boxes coded red/yellow/green.

hey @patrickhulce have you looked into Superset
(Disclaimer: I used to work on the airflow DAG ingestion and viz on a tool that used Superset so I like it and it's python)

Made some candles with some of the dumped data to show what variance in duration of run could look like over multiple commits in candle form.

patrickhulce · 2018-10-19T03:49:11Z

I totally dig candlesticks (though I must say this name is new to me, is it different in any particular way from a box plot or are they the same thing?) this is also how I imagined the visualization of data over time 👍

I've spent some time with superset now, and I think it might be overkill for our use case. The basic installation sets up multiple dbs with a message broker and a stateless python server. We are like the farthest thing from big data and it's super scalable selling points :)

Perhaps it's the incubator status, or the docs don't have the same love as the impl, or I'm just finding all the wrong docs 😆, but I ran into several roadblocks where the setup docs led you to an error (even their super simple docker ones had the wrong commands 😕) and I had to peruse the source to fix and move on. After struggling with still broken dashboards after setting everything else up, I took a whack at a bespoke frontend.

In the same time it took to build the docker compose setup and troubleshoot the dashboard, this is what I got goin'

It can be deployed in two static files to create a new now.sh URL for every PR or w/e, or we can build out a fancier setup with customizable queries and whatnot if need-be. Do we envision needing to create lots of exploratory new queries and dashboards such that the nice GUI dashboard creator of superset would be worth the complexity?

patrickhulce · 2018-10-25T15:04:10Z

If you guys are swamped and would prefer I just move forward with what I think will help the most and hope it's close enough to what you'd ideally want you can say that too :)

paulirish · 2018-10-25T16:32:49Z

@patrickhulce you got this. plenty of things to bikeshed but i think we're aligned on the general approach. +1 to moving forward.

patrickhulce · 2018-12-05T22:49:27Z

So I've been trying to use DZL on a few PRs so far, and I've run into a few issues that really hamper it's usability so far...

Using real sites is much more problematic than expected, even with 25 runs as a corpus

This is mostly solved by creating complex static sites we can measure, it's mainly CNN/news sites that change images, ads, etc within hours. Here's an example of SFGate TTI with 0 Lighthouse changes

The small differences between cloud machines (even of the same type) on static, localhost sites can still generate statistically significant changes in metrics.

I don't have a great answer for this one. Maybe we try and take a better baseline of each machine? Here's an example with 0 FCP affecting changes on the localhost fonts smoketest.

I find myself wanting on-demand, "compare this URL on these two hashes on the same machine, GO"

This one has an easy answer, I just need to build this! Just didn't expect I'd want it so quickly :)

patrickhulce · 2018-12-05T22:50:15Z

@hoten I'd be really eager to hear what you didn't like about the first DZL experience.

What did you want to see? Did you have no clue where to start? etc :)

connorjclark · 2018-12-05T22:56:46Z

Do the results stream in? When I first looked at DZL for #6730, there was only 2 sites in the hash-to-hash page. But, I see all of them now. At first I figured I broke something.

Since the above PR introduced a new audit, there's not a graph for the variance of that audit's score. Would be nice to still show it, even with nothing to compare it to.

patrickhulce · 2018-12-05T22:58:07Z

Since the above URL introduced a new audit, there's not a graph for the variance of that audit's score. Would be nice to still show it, even with nothing to compare it to.

Great point 👍

patrickhulce · 2018-12-05T22:59:48Z

When I first looked at DZL for #6730, there was only 2 sites in the hash-to-hash page.

I need to fix the automatic hash selection. It automatically picks the most recent batch, but sometimes that's a batch that just started instead of the most recent one that's done. You can try one of the older official-ci batches to find one that has all the URLs if that happens again.

patrickhulce · 2018-12-11T20:51:07Z

Closing this in favor of a new tracking issue #6775

patrickhulce added feature needs-priority labels Oct 1, 2018

patrickhulce mentioned this issue Oct 22, 2018

core: enable async stacks #5504

Merged

patrickhulce mentioned this issue Oct 26, 2018

core: show time diffs of pr vs master marks #6409

Closed

patrickhulce mentioned this issue Dec 11, 2018

DZL Tracking Issue #6775

Closed

13 tasks

patrickhulce closed this as completed Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DZL Proposal and Variance Measurement #6152

DZL Proposal and Variance Measurement #6152

patrickhulce commented Oct 1, 2018 •

edited

Loading

patrickhulce commented Oct 4, 2018

paulirish commented Oct 5, 2018

patrickhulce commented Oct 5, 2018 •

edited

Loading

patrickhulce commented Oct 9, 2018

wardpeet commented Oct 9, 2018

patrickhulce commented Oct 9, 2018 •

edited

Loading

paulirish commented Oct 9, 2018

wardpeet commented Oct 10, 2018

patrickhulce commented Oct 10, 2018

exterkamp commented Oct 12, 2018 •

edited

Loading

patrickhulce commented Oct 19, 2018 •

edited

Loading

patrickhulce commented Oct 25, 2018

paulirish commented Oct 25, 2018 •

edited

Loading

patrickhulce commented Dec 5, 2018

patrickhulce commented Dec 5, 2018

connorjclark commented Dec 5, 2018 •

edited

Loading

patrickhulce commented Dec 5, 2018

patrickhulce commented Dec 5, 2018

patrickhulce commented Dec 11, 2018

DZL Proposal and Variance Measurement #6152

DZL Proposal and Variance Measurement #6152

Comments

patrickhulce commented Oct 1, 2018 • edited Loading

The Problem

Recap

Potential Solution Components

Existing Solutions

Proposed MVP

patrickhulce commented Oct 4, 2018

paulirish commented Oct 5, 2018

patrickhulce commented Oct 5, 2018 • edited Loading

patrickhulce commented Oct 9, 2018

Query choices are limited

Points are downsampled to particular timestamp

Histograms don't want to represent every data point, and no unit control

"Time series" by build is clearly not a first-class visualization option

wardpeet commented Oct 9, 2018

patrickhulce commented Oct 9, 2018 • edited Loading

5 buckets

10 buckets (impossible for there to be only 1 data point below 0)

paulirish commented Oct 9, 2018

wardpeet commented Oct 10, 2018

patrickhulce commented Oct 10, 2018

exterkamp commented Oct 12, 2018 • edited Loading

patrickhulce commented Oct 19, 2018 • edited Loading

patrickhulce commented Oct 25, 2018

paulirish commented Oct 25, 2018 • edited Loading

patrickhulce commented Dec 5, 2018

Using real sites is much more problematic than expected, even with 25 runs as a corpus

The small differences between cloud machines (even of the same type) on static, localhost sites can still generate statistically significant changes in metrics.

patrickhulce commented Dec 5, 2018

connorjclark commented Dec 5, 2018 • edited Loading

patrickhulce commented Dec 5, 2018

patrickhulce commented Dec 5, 2018

patrickhulce commented Dec 11, 2018

patrickhulce commented Oct 1, 2018 •

edited

Loading

patrickhulce commented Oct 5, 2018 •

edited

Loading

patrickhulce commented Oct 9, 2018 •

edited

Loading

exterkamp commented Oct 12, 2018 •

edited

Loading

patrickhulce commented Oct 19, 2018 •

edited

Loading

paulirish commented Oct 25, 2018 •

edited

Loading

connorjclark commented Dec 5, 2018 •

edited

Loading