-
-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[14.0][IMP] queue_job: add cron to purge dead jobs. #653
base: 14.0
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,15 @@ | |
<field name="state">code</field> | ||
<field name="code">model.requeue_stuck_jobs()</field> | ||
</record> | ||
<record id="ir_cron_queue_job_fail_dead_jobs" model="ir.cron"> | ||
<field name="name">Take care of unresponsive jobs</field> | ||
<field name="interval_number">15</field> | ||
<field name="interval_type">minutes</field> | ||
<field name="numbercall">-1</field> | ||
<field name="model_id" ref="model_queue_job" /> | ||
<field name="state">code</field> | ||
<field name="code">model.fail_dead_jobs(240)</field> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be nice to add an advice on how to choose the value somewhere if someone wants to adapt it to its config? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. I think a comment here in the code would be nice too 😉 |
||
</record> | ||
<!-- Queue-job-related subtypes for messaging / Chatter --> | ||
<record id="mt_job_failed" model="mail.message.subtype"> | ||
<field name="name">Job failed</field> | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -418,6 +418,61 @@ | |||||||||||||||||||||||||||||||||||||
).requeue() | ||||||||||||||||||||||||||||||||||||||
return True | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
def fail_dead_jobs(self, started_delta, force_low_delta=False): | ||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not a required change, but I think it would be reasonable to <field name="code">model.fail_dead_jobs(240)</field> anyway to significantly decrease the started_delta for my own purposes, so it will be no inconvenience to me to also disable that check. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's the opposite, you will have to change the cron to something like : model.fail_dead_jobs(5, force_low_delta=True) BTW, why are you planning to put a low value here ? What's your use case ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm impatient. 😉 I don't use a dedicated queue_job server, so all my jobs run within the default 60/120 cpu/real time limits. Ten minutes is an eternity. 😉 I also plan to decrease the Dead Job cron interval. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||
"""Set as failed job started since too long ago. | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
Workers can be dead without anyone noticing | ||||||||||||||||||||||||||||||||||||||
Dead workers stuck the channel and provoke | ||||||||||||||||||||||||||||||||||||||
famine. | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
This function, mark jobs started longtime ago | ||||||||||||||||||||||||||||||||||||||
as failed. | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
Cause of death can be CPU Time limit reached | ||||||||||||||||||||||||||||||||||||||
a SIGTERM, a power shortage, we can't know, etc. | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
This mechanism should be very exceptionnal. | ||||||||||||||||||||||||||||||||||||||
It may help, for instance, if someone forget to configure | ||||||||||||||||||||||||||||||||||||||
properly his system. | ||||||||||||||||||||||||||||||||||||||
Comment on lines
+428
to
+436
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
:param started_delta: lookup time in minutes for jobs | ||||||||||||||||||||||||||||||||||||||
that are in started state, | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
:param force_low_delta: force a started_delta in less 10min | ||||||||||||||||||||||||||||||||||||||
= you know what you do | ||||||||||||||||||||||||||||||||||||||
""" | ||||||||||||||||||||||||||||||||||||||
now = fields.datetime.now() | ||||||||||||||||||||||||||||||||||||||
started_dl = now - timedelta(minutes=started_delta) | ||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||||||||||||||||||||||
if started_delta <= 10 and not force_low_delta: | ||||||||||||||||||||||||||||||||||||||
raise exceptions.ValidationError( | ||||||||||||||||||||||||||||||||||||||
_( | ||||||||||||||||||||||||||||||||||||||
"started_delta is too low. Set at least 10min" | ||||||||||||||||||||||||||||||||||||||
" or set argument force_low_delta=True" | ||||||||||||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||||||||||||
domain = [ | ||||||||||||||||||||||||||||||||||||||
"&", | ||||||||||||||||||||||||||||||||||||||
("date_started", "<=", fields.Datetime.to_string(started_dl)), | ||||||||||||||||||||||||||||||||||||||
("state", "=", "started"), | ||||||||||||||||||||||||||||||||||||||
] | ||||||||||||||||||||||||||||||||||||||
job_model = self.env["queue.job"] | ||||||||||||||||||||||||||||||||||||||
stuck_jobs = job_model.search(domain) | ||||||||||||||||||||||||||||||||||||||
msg = { | ||||||||||||||||||||||||||||||||||||||
"exc_info": "", | ||||||||||||||||||||||||||||||||||||||
"exc_name": "Not responding worker. Is it dead ?", | ||||||||||||||||||||||||||||||||||||||
"exc_message": ( | ||||||||||||||||||||||||||||||||||||||
"Check for odoo.service.server logs." | ||||||||||||||||||||||||||||||||||||||
"Investigate logs for CPU time limit reached or check system log" | ||||||||||||||||||||||||||||||||||||||
), | ||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||
for job in stuck_jobs: | ||||||||||||||||||||||||||||||||||||||
# TODO: manage retry: | ||||||||||||||||||||||||||||||||||||||
# if retry < max_retry: retry=+1 and enqueue job instead | ||||||||||||||||||||||||||||||||||||||
# else: set_failed | ||||||||||||||||||||||||||||||||||||||
job_ = Job.load(self.env, job.uuid) | ||||||||||||||||||||||||||||||||||||||
job_.set_failed(**msg) | ||||||||||||||||||||||||||||||||||||||
job_.store() | ||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||
def _get_stuck_jobs_domain(self, queue_dl, started_dl): | ||||||||||||||||||||||||||||||||||||||
domain = [] | ||||||||||||||||||||||||||||||||||||||
now = fields.datetime.now() | ||||||||||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ | |
* Souheil Bejaoui <[email protected]> | ||
* Eric Antones <[email protected]> | ||
* Simone Orsi <[email protected]> | ||
* Raphaël Reverdy <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While thinking of a better name for this cron I wonder.... why don't we add another method and use only one cron?
Eg:
WDYT?