Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long-period naicreport will not scale to fram or betzy (or maybe even saga) #620

Open
lars-t-hansen opened this issue Oct 18, 2024 · 3 comments
Assignees
Labels
component:sonalyze sonalyze/* pri:high task:bug Something isn't working task:enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

Already for Fox the scripts that compute the 3-month load are quite expensive, as they pull in a lot of data and the computation of system load from sonar data is superlinear (#314). Though we may be able to make the algorithm something closer to linear, the underlying strategy of recomputing everything from sample data every time we need it is not sensible in the long term. The issue is really a consequence of a very flexible query model, where we can ask for the load incurred for arbitrary users on arbitrary hosts in an arbitrary time span. This is powerful and we shouldn't abandon it, but it need not be the main engine for simpler day-to-day tasks where we compute the same result every day.

There is the possibility of caching more computed results ( #506 ) but we could also change the strategy for how we handle load data for these reports (also mentioned in #517 ). It's hard to say how urgent this is. For the time being, we'd best not implement load reports for the big iron for more than the last week.

@lars-t-hansen lars-t-hansen added task:bug Something isn't working task:enhancement New feature or request component:sonalyze sonalyze/* labels Oct 18, 2024
@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Oct 24, 2024

The main culprit is simply this:

sonalyze load \
  -fmt=json,datetime,cpu,mem,gpu,gpumem,rcpu,rmem,rres,rgpu,rgpumem,gpus,host \
  -from 90d -daily \
  ... 

@lars-t-hansen
Copy link
Collaborator Author

The profiling strategy is:

  • instrument the code so that each remote command creates a new profile
  • erect an instrumented server on the normal data store, with a generous memory allowance
  • run the above query twice, once to warmup the cache, second time to get a useful profile
  • if the useful profile shows a lot of I/O, then increase the cache size and repeat

@lars-t-hansen
Copy link
Collaborator Author

Well, no surprises there:

(pprof) top 10
Showing nodes accounting for 175.20s, 96.62% of 181.32s total
Dropped 190 nodes (cum <= 0.91s)
Showing top 10 nodes out of 19
      flat  flat%   sum%        cum   cum%
   174.42s 96.19% 96.19%    174.89s 96.45%  sonalyze/sonarlog.mergeStreams
     0.51s  0.28% 96.48%      3.04s  1.68%  sonalyze/sonarlog.createInputStreams
     0.27s  0.15% 96.62%      1.91s  1.05%  runtime.scanobject
         0     0% 96.62%    178.83s 98.63%  main.(*daemonCommandLineHandler).HandleCommand
         0     0% 96.62%    178.83s 98.63%  main.(*standardCommandLineHandler).HandleCommand
         0     0% 96.62%    178.83s 98.63%  main.localAnalysis
         0     0% 96.62%    178.83s 98.63%  net/http.(*ServeMux).ServeHTTP
         0     0% 96.62%    178.83s 98.63%  net/http.(*conn).serve
         0     0% 96.62%    178.83s 98.63%  net/http.HandlerFunc.ServeHTTP
         0     0% 96.62%    178.83s 98.63%  net/http.serverHandler.ServeHTTP

Assuming the current structure, the code looks about as well optimized as it can be. As expected, about 74/174s is spent in the second inner loop "Loop across streams to find smallest head", and about 91/174s is spent near the top of the third inner loop, "Now select values from all streams", either setting up the loop or in the first test, which seems to filter out almost everything (most streams we look at will indeed start in the future, so it should). 74+91= 165, so this is the bulk of it; the rest falls to the body of the third loop below the initial filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sonalyze sonalyze/* pri:high task:bug Something isn't working task:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant