-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DZL Proposal and Variance Measurement #6152
Comments
I didn't explicitly make my case for why I think going with a time series database won't help us very much, and I realize I didn't put up any of my scribbled drawings up either, so here goes a cleaned up version of what I imagined. For our "health dashboard" at a minimum I'm thinking we need to see
None of these are time series questions and all involve doing mean/median/std. dev on a collection of runs matching some git hash/version ID/run ID We want to be able to jump around hashes and easily compare the same version of LH in multiple environments (which is non-trivial with LR to CLI since they will very rarely line up and need to do some nearest git neighbor junk). All this gave me the idea of something like the below. Bad drawings aside, does the bulleted list sound way off to other folks? |
I like everything here in this second comment and agree these will help to illuminate things. But I also think variance and runtime plotted over our commits would be really valuable so we can see how our fixes improve things. E.g. "Our median run time was 30s last month and now its 14s, as you can see. and our 90 pct run time dropped even farther, mostly due to this commit." |
totally agree plotted over commits would be great 👍 This is likely just my timeseries db incompetence, but it always appeared to me that there's not a 'bucket by hash in custom sorted order' style chart that can reuse all the timeseries magic, it always wants timestamps. By that token, I think that a graph by time will lose a decent bit of signal that a graph by commit would show. I will keep digging though. |
Update here based on my experiences trying to get this to work with InfluxDB + Grafana: It gets us 80% of the way there really quickly. There's definitely a substantial amount of fighting with the timeseries-nature going on. Most of it has workarounds with extra management/planning, but I think it's unlikely we can satisfy 100% of the use cases outlined in initial comment. Maybe that's fine and we're happy with the trade-off, but I'll outline what I found the main challenges to be. Query choices are limitedExample: there is no Points are downsampled to particular timestampWorkaround, we fake timestamps for each datapoint so they're unique. Limitation is super minor, just no running different jobs concurrently or risk losing data. Histograms don't want to represent every data point, and no unit controlThis one feels like I have to be doing something wrong. Even though I've removed every group by and time-related grouping option I can find, histograms still don't always want to represent every data point. Strangest part: changing the # of buckets changes the total count every time (?????) "Time series" by build is clearly not a first-class visualization optionI found a workaround to treat each build like a next data point in a time series, but the visualization options become very limited using this approach and will not scale with many different builds. Example of limitation: it's bar chart only, no line graphs or variance bars, cannot control the units used, etc. Overall it feels like we'd give up on this and just use an uneven time series. Overall I think we can live with these or build enough tools around them to limit their impact, but it does mean a lot of use cases will become more difficult, i.e. comparing two different builds will always be manually selecting their respective hashes and presenting their reports side-by-side instead of some sort of custom in-line diff. Previously folks seemed to think these limitations were worth the tradeoff. WDYT? |
Grafana is sadly great for timeseries but not for other data. Setting it up and making dashboards look really easy 💯 but maybe it's not the stack we need? If we don't have timeseries data we might want to look to the ELK stack? (elastic + kibana) I'm pretty sure they support line graphs with a custom x axis. The downside is that you need elastic as a data source so we can't have a sqlite or whatever. (http://logz.io has a free trial we might want to use to test) Another way is to create some custom graphs ourselves but that defeats the purpose of K-I-S-S and is something we don't really want to manage. |
OK. I think i'm sufficiently convinced that these tools optimize for metrics happening continuously and requiring grouping. Our usecase of discrete metrics every hour or day isn't supported well by them. (your 3rd bold point). I appreciate the attempt to make it work, but agree that grafana isn't a great solution for what we're trying to do here. |
The ELK stack is not a big difference. I could set it up on a private instance and give you guys access to import some data and make the correct graphs. Than you don't have to be bothered to set up the stack. |
OK, so I think we're all on the same page about timeseries solutions not being the best for us. Before jumping into the next attempt, I want to get some clarity on what exactly we want to show up in the dashboard. This is what the grafana one looked like: It surfaced
@brendankenny you mentioned you had different metrics in mind Any more other feedback on this set of metrics before I go off? |
Hey, so yeah, I was thinking that I want to see data commit-over-commit so that I could see if a specific commit has introduced a problem, or that we can see that a variance has been reduced. I am still liking the idea of a candlestick graph with the data like this:
This would allow us to visualize when the variance was narrowing i.e. the std dev would be going down over time: Or if a specific commit increased variance: So that is kind of how I like to visualize the scores over time, either with a candlestick chart, or with a line chart + shaded area of +/- 1-2 std dev around it to show the variance. I like the current visualizations esp. broken down by URL. But personally I want to see line charts/candlestick charts that show me what each metric is doing over time so that I can see if something is getting out of hand over time or degrading slowly. But for snapshots I like all the called out percentages and variance boxes coded red/yellow/green. hey @patrickhulce have you looked into Superset Made some candles with some of the dumped data to show what variance in duration of run could look like over multiple commits in candle form. |
I totally dig candlesticks (though I must say this name is new to me, is it different in any particular way from a box plot or are they the same thing?) this is also how I imagined the visualization of data over time 👍 I've spent some time with superset now, and I think it might be overkill for our use case. The basic installation sets up multiple dbs with a message broker and a stateless python server. We are like the farthest thing from big data and it's super scalable selling points :) Perhaps it's the incubator status, or the docs don't have the same love as the impl, or I'm just finding all the wrong docs 😆, but I ran into several roadblocks where the setup docs led you to an error (even their super simple docker ones had the wrong commands 😕) and I had to peruse the source to fix and move on. After struggling with still broken dashboards after setting everything else up, I took a whack at a bespoke frontend. In the same time it took to build the docker compose setup and troubleshoot the dashboard, this is what I got goin' It can be deployed in two static files to create a new |
If you guys are swamped and would prefer I just move forward with what I think will help the most and hope it's close enough to what you'd ideally want you can say that too :) |
@patrickhulce you got this. plenty of things to bikeshed but i think we're aligned on the general approach. +1 to moving forward. |
@hoten I'd be really eager to hear what you didn't like about the first DZL experience. What did you want to see? Did you have no clue where to start? etc :) |
Do the results stream in? When I first looked at DZL for #6730, there was only 2 sites in the hash-to-hash page. But, I see all of them now. At first I figured I broke something. Since the above PR introduced a new audit, there's not a graph for the variance of that audit's score. Would be nice to still show it, even with nothing to compare it to. |
Great point 👍 |
I need to fix the automatic hash selection. It automatically picks the most recent batch, but sometimes that's a batch that just started instead of the most recent one that's done. You can try one of the older |
Closing this in favor of a new tracking issue #6775 |
I've been doing some thinking and background research on what's out there. I'm taking down my thoughts here as a sounding board so we can narrow down to an MVP.
The Problem
LH team needs to be able to understand metric variance and overall runtime/latency (bonus if possible: accuracy) in different environments, how changes we make are affecting these attributes, and how we are trending over time.
Recap
Need to monitor:
Across:
Use Cases:
Potential Solution Components
n
times in a particular environment on given URLs and storing the results in some queryable formatmaster
resultsmaster
resultsExisting Solutions
The good news: we have an awesome community that has built lots of things to look at LH results over time :)
The bad news: their big selling points usually revolve around ease of time series data and abstracting away the environment concerns (which is the one piece we will actually need to change up and have control over the most) :/
Only one of the use cases here is really a timeseries problem (and even then it's not a real-time timeseries, it's commit level timeseries). That's not to say we can't repurpose a timeseries DB for our use cases, graphana still supports histograms and all that, it just is a bit of shoehorn for some of the things we'll want to do.
Other problem, one of the big things we actually care about most in all of this is differences between versions of Lighthouse. Given that abstracting the environment away and keeping it stable is a selling point of all these solutions, breaking in to make comparing versions our priority is really cutting against the grain. Again, not impossible, but not exactly leveraging the strengths of these solutions.
Proposed MVP
K-I-S-S, keep it simple stupid. Great advice, hurts my feelings every time.
Simple CLI with 2 commands.
run
- handle the runn
times and save piece, single js file for each connector we need to run, just local and LR to startserve
- serve a site that enables the visualization piecesThese two commands share a CLI config that specifies storage location. I'm thinking sqlite to start to avoid any crazy docker mess and work with some hypothetical remote SQL server. We can include a field for the raw response so we can always add more columns easily later.
Thoughts so far? Did I completely miss what the pain is from others' perspective? Does it sound terrifyingly similar to
plots
😱The text was updated successfully, but these errors were encountered: