Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[14.0][IMP] queue_job: add cron to purge dead jobs. #653

Open
wants to merge 3 commits into
base: 14.0
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions queue_job/data/queue_data.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,15 @@
<field name="state">code</field>
<field name="code">model.requeue_stuck_jobs()</field>
</record>
<record id="ir_cron_queue_job_fail_dead_jobs" model="ir.cron">
<field name="name">Take care of unresponsive jobs</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While thinking of a better name for this cron I wonder.... why don't we add another method and use only one cron?

Eg:

<field name="code">model.gc_stuck_jobs()</field>
[...]

def gc_stuck_jobs(self, started_delta=None, force_low_delta=False):
    self.requeue_stuck_jobs()
    if started_delta:
        self.gc_dead_jobs(started_delta, force_low_delta=force_low_delta)

WDYT?

<field name="interval_number">15</field>
<field name="interval_type">minutes</field>
<field name="numbercall">-1</field>
<field name="model_id" ref="model_queue_job" />
<field name="state">code</field>
<field name="code">model.fail_dead_jobs(240)</field>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add an advice on how to choose the value somewhere if someone wants to adapt it to its config?
In the Readme or in the methode description ? The ideal value would be the cpu_time limit from the queue job server config right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I think a comment here in the code would be nice too 😉

</record>
<!-- Queue-job-related subtypes for messaging / Chatter -->
<record id="mt_job_failed" model="mail.message.subtype">
<field name="name">Job failed</field>
Expand Down
55 changes: 55 additions & 0 deletions queue_job/models/queue_job.py
Original file line number Diff line number Diff line change
Expand Up @@ -418,6 +418,61 @@
).requeue()
return True

def fail_dead_jobs(self, started_delta, force_low_delta=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a required change, but I think it would be reasonable to force_low_delta=True here to turn your safety check on by default. I'm going to override

            <field name="code">model.fail_dead_jobs(240)</field>

anyway to significantly decrease the started_delta for my own purposes, so it will be no inconvenience to me to also disable that check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the opposite, you will have to change the cron to something like :

model.fail_dead_jobs(5, force_low_delta=True)

BTW, why are you planning to put a low value here ? What's your use case ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why are you planning to put a low value here ? What's your use case ?

I'm impatient. 😉 I don't use a dedicated queue_job server, so all my jobs run within the default 60/120 cpu/real time limits. Ten minutes is an eternity. 😉 I also plan to decrease the Dead Job cron interval.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def fail_dead_jobs(self, started_delta, force_low_delta=False):
def gc_dead_jobs(self, started_delta, force_low_delta=False):

"""Set as failed job started since too long ago.

Workers can be dead without anyone noticing
Dead workers stuck the channel and provoke
famine.

This function, mark jobs started longtime ago
as failed.

Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, we can't know, etc.

This mechanism should be very exceptionnal.
It may help, for instance, if someone forget to configure
properly his system.
Comment on lines +428 to +436
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This function, mark jobs started longtime ago
as failed.
Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, we can't know, etc.
This mechanism should be very exceptionnal.
It may help, for instance, if someone forget to configure
properly his system.
This function marks jobs started longtime ago
as failed.
Cause of death can be CPU Time limit reached
a SIGTERM, a power shortage, etc. We can't know.
This mechanism should be very exceptional.
It may help, for instance, if someone forgot to configure
properly his system.


:param started_delta: lookup time in minutes for jobs
that are in started state,

:param force_low_delta: force a started_delta in less 10min
= you know what you do
"""
now = fields.datetime.now()
started_dl = now - timedelta(minutes=started_delta)

Check warning on line 445 in queue_job/models/queue_job.py

View check run for this annotation

Codecov / codecov/patch

queue_job/models/queue_job.py#L444-L445

Added lines #L444 - L445 were not covered by tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
started_dl = now - timedelta(minutes=started_delta)
started_dl = fields.Datetime.subtract(now, minutes=started_delta)

if started_delta <= 10 and not force_low_delta:
raise exceptions.ValidationError(

Check warning on line 447 in queue_job/models/queue_job.py

View check run for this annotation

Codecov / codecov/patch

queue_job/models/queue_job.py#L447

Added line #L447 was not covered by tests
_(
"started_delta is too low. Set at least 10min"
" or set argument force_low_delta=True"
)
)
domain = [

Check warning on line 453 in queue_job/models/queue_job.py

View check run for this annotation

Codecov / codecov/patch

queue_job/models/queue_job.py#L453

Added line #L453 was not covered by tests
"&",
("date_started", "<=", fields.Datetime.to_string(started_dl)),
("state", "=", "started"),
]
job_model = self.env["queue.job"]
stuck_jobs = job_model.search(domain)
msg = {

Check warning on line 460 in queue_job/models/queue_job.py

View check run for this annotation

Codecov / codecov/patch

queue_job/models/queue_job.py#L458-L460

Added lines #L458 - L460 were not covered by tests
"exc_info": "",
"exc_name": "Not responding worker. Is it dead ?",
"exc_message": (
"Check for odoo.service.server logs."
"Investigate logs for CPU time limit reached or check system log"
),
}
for job in stuck_jobs:
# TODO: manage retry:
# if retry < max_retry: retry=+1 and enqueue job instead
# else: set_failed
job_ = Job.load(self.env, job.uuid)
job_.set_failed(**msg)
job_.store()

Check warning on line 474 in queue_job/models/queue_job.py

View check run for this annotation

Codecov / codecov/patch

queue_job/models/queue_job.py#L472-L474

Added lines #L472 - L474 were not covered by tests

def _get_stuck_jobs_domain(self, queue_dl, started_dl):
domain = []
now = fields.datetime.now()
Expand Down
1 change: 1 addition & 0 deletions queue_job/readme/CONTRIBUTORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@
* Souheil Bejaoui <[email protected]>
* Eric Antones <[email protected]>
* Simone Orsi <[email protected]>
* Raphaël Reverdy <[email protected]>
Loading