Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF bjobs for all users hangs #745

Open
vallerul opened this issue Feb 11, 2022 · 3 comments
Open

LSF bjobs for all users hangs #745

vallerul opened this issue Feb 11, 2022 · 3 comments

Comments

@vallerul
Copy link

vallerul commented Feb 11, 2022

Active jobs app, for all users will hang when LSF runs thousands of jobs, and the active history in LSF is kept for days instead of hours. CLEAN_PERIOD in LSF configuration controls how much data bjobs retrieve. CLEAN_PERIOD is usually a day, but when increased to 3 days , it caused a forever hang.
I see that the issue is because of bjobs arguments in lib/ood_core/job/adapters/lsf/batch.rb :

def get_jobs_for_user(user) args = %W( -u #{user} -a -w -W ) parse_bjobs_output(call("bjobs", *args)) end
bjobs -u all -a -w -W is very resource intensive when thousands of jobs are scheduled, and almost can take forever to return.

I had to make the following change ( remove -a ) to make it respond:

def get_jobs_for_user(user) args = %W( -u #{user} -w -W ) parse_bjobs_output(call("bjobs", *args)) end

It would be good to keep the above configurable, instead of making the change in code.

┆Issue is synchronized with this Asana task by Unito

@johrstrom
Copy link
Contributor

Thanks. Looks like it's here

args = %W( -u #{user} -a -w -W )

and here

args = %W( -a -w -W #{id.to_s} )

Does bjobs respond to the environment variable CLEAN_PERIOD?

@vallerul
Copy link
Author

As far as I remember, it cannot be used as an environment variable.
CLEAN_PERIOD is part of lsb.params configuration file, and is usually set as part of scheduler policies.

https://www.ibm.com/support/pages/how-increase-default-retention-job-information-lsf-memory.

@johrstrom
Copy link
Contributor

hmmm ok. yea it seems like we could default to false (not using the flag) and folks can enable it if they choose.

I can't recall what torque did, but Slurm doesn't keep job info around for very long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants