-
-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[14.0][IMP] queue_job: add cron to purge dead jobs. #653
base: 14.0
Are you sure you want to change the base?
Conversation
Hi @guewen, |
queue_job/data/queue_data.xml
Outdated
@@ -10,6 +10,15 @@ | |||
<field name="state">code</field> | |||
<field name="code">model.requeue_stuck_jobs()</field> | |||
</record> | |||
<record id="ir_cron_queue_job_fail_dead_jobs()" model="ir.cron"> | |||
<field name="name">Jobs Garbage Collector</field> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the same name as the job above.
queue_job/models/queue_job.py
Outdated
""" | ||
now = fields.datetime.now() | ||
started_dl = now - timedelta(minutes=started_delta) | ||
if started_delta <= 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of hardcoded 10 minutes here, perhaps use config['limit_time_real']
seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may have a limit_time_real short for the regular http workers and a long for queue workers on a separate instance.
The idea is to use this cron as a last line of defence against poorly configured instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this would be better served with something like
if started_delta <= int(self.env['ir.config_parameter'].sudo().get_param('queue_job.limit_time_dead', 10)):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an argument to bypass the 10 min limit without having to change the code
@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0): | |||
).requeue() | |||
return True | |||
|
|||
def fail_dead_jobs(self, started_delta, force_low_delta=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a required change, but I think it would be reasonable to force_low_delta=True
here to turn your safety check on by default. I'm going to override
<field name="code">model.fail_dead_jobs(240)</field>
anyway to significantly decrease the started_delta for my own purposes, so it will be no inconvenience to me to also disable that check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the opposite, you will have to change the cron to something like :
model.fail_dead_jobs(5, force_low_delta=True)
BTW, why are you planning to put a low value here ? What's your use case ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why are you planning to put a low value here ? What's your use case ?
I'm impatient. 😉 I don't use a dedicated queue_job server, so all my jobs run within the default 60/120 cpu/real time limits. Ten minutes is an eternity. 😉 I also plan to decrease the Dead Job cron interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I believe in a new version (18?) it would be nice to merge both cron.
The cron would then : requeue the enqueued jobs + set to fail the jobs started for too long (+ put this behavior by default).
But it is better not to do it in an already released version.
<field name="numbercall">-1</field> | ||
<field name="model_id" ref="model_queue_job" /> | ||
<field name="state">code</field> | ||
<field name="code">model.fail_dead_jobs(240)</field> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to add an advice on how to choose the value somewhere if someone wants to adapt it to its config?
In the Readme or in the methode description ? The ideal value would be the cpu_time limit from the queue job server config right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think a comment here in the code would be nice too 😉
This function, mark jobs started longtime ago | ||
as failed. | ||
|
||
Cause of death can be CPU Time limit reached | ||
a SIGTERM, a power shortage, we can't know, etc. | ||
|
||
This mechanism should be very exceptionnal. | ||
It may help, for instance, if someone forget to configure | ||
properly his system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function, mark jobs started longtime ago | |
as failed. | |
Cause of death can be CPU Time limit reached | |
a SIGTERM, a power shortage, we can't know, etc. | |
This mechanism should be very exceptionnal. | |
It may help, for instance, if someone forget to configure | |
properly his system. | |
This function marks jobs started longtime ago | |
as failed. | |
Cause of death can be CPU Time limit reached | |
a SIGTERM, a power shortage, etc. We can't know. | |
This mechanism should be very exceptional. | |
It may help, for instance, if someone forgot to configure | |
properly his system. |
= you know what you do | ||
""" | ||
now = fields.datetime.now() | ||
started_dl = now - timedelta(minutes=started_delta) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
started_dl = now - timedelta(minutes=started_delta) | |
started_dl = fields.Datetime.subtract(now, minutes=started_delta) |
@@ -418,6 +418,61 @@ def requeue_stuck_jobs(self, enqueued_delta=5, started_delta=0): | |||
).requeue() | |||
return True | |||
|
|||
def fail_dead_jobs(self, started_delta, force_low_delta=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def fail_dead_jobs(self, started_delta, force_low_delta=False): | |
def gc_dead_jobs(self, started_delta, force_low_delta=False): |
<field name="numbercall">-1</field> | ||
<field name="model_id" ref="model_queue_job" /> | ||
<field name="state">code</field> | ||
<field name="code">model.fail_dead_jobs(240)</field> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think a comment here in the code would be nice too 😉
@@ -10,6 +10,15 @@ | |||
<field name="state">code</field> | |||
<field name="code">model.requeue_stuck_jobs()</field> | |||
</record> | |||
<record id="ir_cron_queue_job_fail_dead_jobs" model="ir.cron"> | |||
<field name="name">Take care of unresponsive jobs</field> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While thinking of a better name for this cron I wonder.... why don't we add another method and use only one cron?
Eg:
<field name="code">model.gc_stuck_jobs()</field>
[...]
def gc_stuck_jobs(self, started_delta=None, force_low_delta=False):
self.requeue_stuck_jobs()
if started_delta:
self.gc_dead_jobs(started_delta, force_low_delta=force_low_delta)
WDYT?
I'm trying to improve the remediation of started job stucked.
Cron Jobs Garbage Collector, with the second parameter, let you requeue started job. But if the root cause like a CPU limit error is still present after the requeue, the issue always reappear.
With this new cron, the job is marked as failed and not requeued.