Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address slowness resulting from size of MODE data #67

Open
mollybsmith-noaa opened this issue Feb 10, 2022 · 120 comments
Open

Address slowness resulting from size of MODE data #67

mollybsmith-noaa opened this issue Feb 10, 2022 · 120 comments
Assignees
Labels
priority: high High Priority requestor: NOAA/GSL NOAA Global Systems Laboratory type: task An actionable item of work

Comments

@mollybsmith-noaa
Copy link
Collaborator

The MODE output from MET is quite large, and may cause problems if we run it in realtime and try to store all of it in one database. We need to discuss solutions for what to do about this.

@mollybsmith-noaa mollybsmith-noaa added priority: high High Priority type: task An actionable item of work requestor: NOAA/GSL NOAA Global Systems Laboratory labels Feb 10, 2022
@mollybsmith-noaa mollybsmith-noaa self-assigned this Feb 10, 2022
@mollybsmith-noaa mollybsmith-noaa changed the title Discuss how best to deal with size of MODE data Address slowness resulting from size of MODE data Mar 29, 2022
@mollybsmith-noaa
Copy link
Collaborator Author

To document the slowness of mysql queries for MODE data, I plotted our most taxing test case: a month-long reflectivity dieoff. These are the parameters:

Screen Shot 2022-03-25 at 12 05 22 PM

This plot requires two queries (to avoid doing a horrendous table join):

select h.fcst_lead as fcst_lead, count(distinct unix_timestamp(h.fcst_valid)) as N_times, min(unix_timestamp(h.fcst_valid)) as min_secs, max(unix_timestamp(h.fcst_valid)) as max_secs, avg(ld2.area) as area, group_concat(distinct ld2.object_id, ';', h.mode_header_id, ';', ld2.area, ';', ld2.intensity_nn, ';', ld2.centroid_lat, ';', ld2.centroid_lon, ';', unix_timestamp(h.fcst_valid), ';', h.fcst_lev order by unix_timestamp(h.fcst_valid), h.fcst_lev) as sub_data2 from mv_gsl_mode_retros.mode_header h, mv_gsl_mode_retros.mode_obj_single ld2 where 1=1 and unix_timestamp(h.fcst_valid) >= '1588483800' and unix_timestamp(h.fcst_valid) <= '1591023600' and h.model = 'HRRRv4' and h.fcst_var = 'REFC' and h.fcst_lev IN('L0','L0=') and h.descr IN('ECONUS') and ld2.simple_flag = 1 and h.mode_header_id = ld2.mode_header_id group by fcst_lead order by fcst_lead;

select h.fcst_lead as fcst_lead, count(distinct unix_timestamp(h.fcst_valid)) as N_times, min(unix_timestamp(h.fcst_valid)) as min_secs, max(unix_timestamp(h.fcst_valid)) as max_secs, avg(ld.interest) as interest, group_concat(distinct ld.interest, ';', ld.object_id, ';', h.mode_header_id, ';', ld.centroid_dist, ';', unix_timestamp(h.fcst_valid), ';', h.fcst_lev order by unix_timestamp(h.fcst_valid), h.fcst_lev) as sub_data from mv_gsl_mode_retros.mode_header h, mv_gsl_mode_retros.mode_obj_pair ld where 1=1 and unix_timestamp(h.fcst_valid) >= '1588483800' and unix_timestamp(h.fcst_valid) <= '1591023600' and h.model = 'HRRRv4' and h.fcst_var = 'REFC' and h.fcst_lev IN('L0','L0=') and h.descr IN('ECONUS') and ld.simple_flag = 1 and h.mode_header_id = ld.mode_header_id group by fcst_lead order by fcst_lead;

For this example on metv-gsd.gsd.esrl.noaa.gov, the combined query execution time was 6:23 (the vast majority of this was spent on the second query), and the time to execute the METexpress statistical code on the results was 2:49.

@bonnystrong
Copy link

bonnystrong commented Mar 30, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

mollybsmith-noaa commented Mar 30, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Today's mysql speed test returned a query time of 7:16, and a statistic calculation time of 2:36.

@mollybsmith-noaa
Copy link
Collaborator Author

One interesting note: while the post-query statistical analysis on localhost takes about two and a half minutes, it seems to take 5-6 times as long on mats-docker-int (I'll try to get a specific number later). Are our containers underpowered? This seems like something we should absolutely fix if we can.

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Okay, on MATS/METexpress, if you click "Data Lineage" from the graph page, there is a field under "basis" called "data retrieval (query) time". In METexpress, this field represents the TOTAL time for TWO things to occur: 1) for the mysql database itself to return the results of the query, and 2) for python_query_util to parse the returned data into the fields expected by the MATScommon routines and calculate the statistic. Both of these are considered "query time".

I've been trying this week to quantify the time that both 1) and 2) take in MET Objects.

Speed results for 1):
When using the query given previously with the mysql database on metv-gsd, both localhost and the containers deployed on mats-docker-int take around 6-7 minutes to get their data back from the database if it is not cached, and around 1 minute if it is cached. On localhost I can capture this by putting print statements in python_query_util, while on mats-docker-int I have to have a terminal window open to metv-gsd and spam show processlist; to see how long the query takes. However, it seems to be about the same for both.

Speed results for 2):
On localhost, once the data comes back from the database, it takes about 2 and a half minutes to run through all of Jeff Duda's statistical routines and exit python_query_util without any errors. On mats-docker-int, it takes 8-10 minutes, in the absence of caching. This is a substantial difference and leads me to believe that the containers don't have the resources they need. The speed for this component is determined by taking the "data retrieval (query) time" from the Data Lineage and subtracting the time needed for the query itself.

All of these times are for dieoffs, the other plot types are much faster.

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Yes, but I want to be very specific that it is data processing in the query routines, and as such MATS/METexpress considers it to be "query time". I have not looked at the rest of the data processing, such as diff curves, setting plot options, etc. The query code is the same in both cases but it takes several times as long when run in a container on our MATS servers.

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Sure, I can check that. Is there any way to beef up the VMs then? They are substantially slower.

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

If you want to hop on now I'll do it. mats-docker-int.

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Okay. They query should be cached so it'll take about a minute, and then the query routines should take longer. Plotting now.

@mollybsmith-noaa
Copy link
Collaborator Author

Nope, MATS had cached that plot itself. One moment.

@mollybsmith-noaa
Copy link
Collaborator Author

Right, plotting now.

@mollybsmith-noaa
Copy link
Collaborator Author

Database is working on query.

@mollybsmith-noaa
Copy link
Collaborator Author

Still on mysql component.

@mollybsmith-noaa
Copy link
Collaborator Author

I'll let you know when it leave the processlist and we move to phase 2)

@mollybsmith-noaa
Copy link
Collaborator Author

Phase 2!

@mollybsmith-noaa
Copy link
Collaborator Author

CPU is at 100% and Memory is 10-15 %

@randytpierce
Copy link
Collaborator

randytpierce commented Mar 31, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Yes

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

I don't think it totally matters, because the difference between databases is on the order of minutes, while the localhost vs container discrepancy we're trying to test is on the order of hours, but I'm using metv-gsd, same as we had for mats-docker-int the other day. I believe that the localhost settings currently have model-vxtest.

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

FYI, the slow thing we're trying to test (getting the statistics calculated and the data formatted properly) is included in "data retrieval (query) time". It's not considered post-query because it's performed by the query routines.

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

That's the AWS machine! Where did that come from??

@mollybsmith-noaa
Copy link
Collaborator Author

The fastest database of them all.

@mollybsmith-noaa
Copy link
Collaborator Author

Okay, I've made two dieoffs at this point, and while they did take slightly longer than using meteor run for localhost:3000, they were nowhere near the hours we saw on mats-docker-int.

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

It was like an extra minute for each compared to localhost. Totally acceptable.

@mollybsmith-noaa
Copy link
Collaborator Author

What is the exact path you are using for your settings file?

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Yes, it definitely does. Not just mats-docker-int, though, all of the mats-docker-*

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Can you do a git fetch on the mats-settings repo and see if anything changes? That is not what I pushed last week.

@mollybsmith-noaa
Copy link
Collaborator Author

And indeed about rancher!

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

What branch are you on? This is bizarre.

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

You'll have to screen share at the meeting and show me. I am quite confused. Is there some sort of weird symbolic link somewhere?

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@mollybsmith-noaa
Copy link
Collaborator Author

Today in our meeting, Randy, Ian, and I ran the same met-object container on Docker Desktop, mats-docker-int, and model-vxtest. On both Docker-Desktop and model-vxtest, which are not VMs, the met-object container returned our sample MODE dieoff plot in 7-8 minutes, on par with localhost's 6-7 minutes. This means that there is nothing wrong with the container itself, and alpine linux is not the culprit. However, on mats-docker-int, which is a VM, the met-object container ran for an hour and still had not returned a plot. This implies that there is something about the VM itself that is causing our slowness.

@randytpierce
Copy link
Collaborator

randytpierce commented Apr 6, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Oct 11, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Oct 11, 2022 via email

@randytpierce
Copy link
Collaborator

randytpierce commented Oct 11, 2022 via email

@bonnystrong
Copy link

Needs to be addressed by changing MySQL schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high High Priority requestor: NOAA/GSL NOAA Global Systems Laboratory type: task An actionable item of work
Projects
None yet
Development

No branches or pull requests

4 participants