Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill workers that don't stop after a configurable time #13805

Merged
merged 2 commits into from
Feb 14, 2017

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Feb 7, 2017

Previously, we'd gracefully ask them to exit and if the queue work
they're doing, takes 1 hour to do, they'd exceed
memory thresholds, keep running until the work is done and finally
respond to the exit request.

Now, we mark them as 'stopping' when they
exceed a threshold and they have up to 10 minutes to finish before we'd
kill them. This value is configurable in the 'stopping_timeout' field in
each worker's advanced settings.

https://bugzilla.redhat.com/show_bug.cgi?id=1395736

@gtanzillo @carbonin Please review

@@ -99,6 +99,12 @@ def check_not_responding(class_name = nil)
processed_workers.collect(&:id)
end

NOT_RESPONDING = :not_responding
MEMORY_EXCEEDED = :memory_exceeded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this information belongs somewhere else. Maybe MiqServer::WorkerManagement::Monitor::Reason?

Then ideally callers of worker_set_monitor_reason will also use the same constants. Not sure if making that kind of change is in this PR's scope though.

Copy link
Member

@gtanzillo gtanzillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good

@gtanzillo
Copy link
Member

Will merge after @jrafanie makes a couple of small changes.

Previously, we'd gracefully ask them to exit and if the queue work
they're doing, takes 1 hour to do, they'd exceed
memory thresholds, keep running until the work is done and finally
respond to the exit request.

Now, we mark them as 'stopping' when they
exceed a threshold and they have up to 10 minutes to finish before we'd
kill them.  This value is configurable in the 'stopping_timeout' field in
each worker's advanced settings.

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@miq-bot
Copy link
Member

miq-bot commented Feb 10, 2017

Checked commits jrafanie/manageiq@e5f4bd3~...b60a5f0 with ruby 2.2.6, rubocop 0.47.1, and haml-lint 0.20.0
6 files checked, 1 offense detected

spec/models/miq_server/worker_management/monitor_spec.rb

@jrafanie
Copy link
Member Author

Ok, I think I got your good suggestion in. Take another look @carbonin

Copy link
Member

@carbonin carbonin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jrafanie
Copy link
Member Author

cc @jcarter12 (this is the stopping workers PR)

@gtanzillo gtanzillo added this to the Sprint 54 Ending Feb 13, 2017 milestone Feb 14, 2017
@gtanzillo gtanzillo merged commit 9764870 into ManageIQ:master Feb 14, 2017
jrafanie added a commit to jrafanie/manageiq that referenced this pull request Feb 14, 2017
Due to module inclusion spaghetti, it's easier and less confusing
to reference the Reason constants consistently in the MiqServer class,
which is the ultimate destination for all of these modules.

Fixes ManageIQ#13901 introduced in ManageIQ#13805

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@jrafanie jrafanie deleted the stopping_worker branch February 14, 2017 17:35
jrafanie pushed a commit to jrafanie/manageiq that referenced this pull request Feb 16, 2017
Kill workers that don't stop after a configurable time
(cherry picked from commit 9764870)

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
jrafanie pushed a commit to jrafanie/manageiq that referenced this pull request Feb 16, 2017
Kill workers that don't stop after a configurable time
(cherry picked from commit 9764870)

https://bugzilla.redhat.com/show_bug.cgi?id=1395736
@jrafanie jrafanie added bug and removed enhancement labels Feb 16, 2017
@jrafanie
Copy link
Member Author

Euwe backport: #13949
Darga backport: #13950

@simaishi
Copy link
Contributor

simaishi commented Mar 3, 2017

Backported to Euwe via #13949

@simaishi simaishi removed the euwe/yes label Mar 3, 2017
@simaishi
Copy link
Contributor

Backported to Darga via #13950

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants