Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActionMailer::MailDeliveryJob executing twice #329

Closed
tobyaherbert opened this issue Aug 12, 2021 · 9 comments
Closed

ActionMailer::MailDeliveryJob executing twice #329

tobyaherbert opened this issue Aug 12, 2021 · 9 comments
Labels
bug Something isn't working

Comments

@tobyaherbert
Copy link

tobyaherbert commented Aug 12, 2021

I’ve recently noticed that some of the ActionMailer::MailDeliveryJob jobs enqueued using the deliver_later method appear to be executed twice despite being enqueued for delivery only once.

GoodJob is configured with the default number of threads and Rails is scheduling the ActionMailer::MailDeliveryJob on the :default queue.

After a bit of digging through the log file, it looks like GoodJob could be executing the same job multiple times concurrently as the same ActiveJob UUID appears on both occurrences of the delivery job’s output which each have a different Message-ID.

I’ve uploaded a simplified copy of the log file here from which I’ve deduced the following:

  • Lines 1-10 and 13-22 have the same job UUID.
  • Lines 27 and 44 show a different Message-ID, though they are both part of multiline output (starting on lines 24 and 41 respectively) which have the same job UUID.
  • Lines 33 and 49 show different GoodJob workers having executed the same job UUID.
  • Lines 35 and 51 show GoodJob deleting the same job UUID from its table which seems strange.

As a temporary fix, I’ve moved all ActionMailer::MailDeliveryJob jobs to a :mailer queue which has just one thread to prevent this from happening.

Could this be a bug in GoodJob or is there something else I should check?

Thanks!

@bensheldon
Copy link
Owner

bensheldon commented Aug 12, 2021

@tobyaherbert thanks for reporting this and sharing the logs. This is unexpected.

Do you have the logs preceding slightly preceding what you shared?

On Line 11 and 12, there are these 2 lines, which is one GoodJob thread/worker fetching and locking the database record, and then verifying that it does have a lock. I only see this line once, and I'm curious if that is in the earlier in the logs of the other worker also fetching and verifying the lock:

11  GoodJob::Job Load (7.0ms)  SELECT "good_jobs".* FROM "good_jobs" WHERE "good_jobs"."id" IN (WITH "rows" AS MATERIALIZED ( SELECT "good_jobs"."id" FROM "good_jobs" WHERE "good_jobs"."queue_name" = 'default' AND "good_jobs"."finished_at" IS NULL AND ("good_jobs"."scheduled_at" <= '2021-08-11 14:59:32.115185' OR "good_jobs"."scheduled_at" IS NULL) ORDER BY priority DESC NULLS LAST ) SELECT "rows"."id" FROM "rows" WHERE pg_try_advisory_lock(('x' || substr(md5('good_jobs' || "rows"."id"::text), 1, 16))::bit(64)::bigint) LIMIT $1) ORDER BY priority DESC NULLS LAST  [["LIMIT", 1]]
12  GoodJob::Job Exists? (20.3ms)  SELECT 1 AS one FROM "good_jobs" LEFT JOIN pg_locks ON pg_locks.locktype = 'advisory' AND pg_locks.objsubid = 1 AND pg_locks.classid = ('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(32)::int AND pg_locks.objid = (('x' || substr(md5('good_jobs' || "good_jobs"."id"::text), 1, 16))::bit(64) << 32)::bit(32)::int WHERE "good_jobs"."finished_at" IS NULL AND ("pg_locks"."pid" = pg_backend_pid()) AND "good_jobs"."id" = $1 LIMIT $2  [["id", "33091f18-02c1-491c-8aff-a267140558f9"], ["LIMIT", 1]]

I don't have an explanation, but I can walk you through my thinking in case it spurs some insight:

  • Rails will checkout a dedicated database connection for each thread.
  • GoodJob expects each thread/worker to have its own dedicated database connection, which it uses to create connection-level advisory locks.
  • The only way I can imagine that two threads/workers would be able to execute a single GoodJob Job (e.g. a record in the good_jobs table) at the same time would be if the threads/workers were sharing a database connection, or if a database connection in one thread/worker was closed/dropped in the middle of executing the job.

Could you confirm what version of Ruby and Rails you're using? And also whether you're using PgBouncer (a database connection pooler). And are you seeing this locally too, or just in production? Strange!

@tobyaherbert
Copy link
Author

@bensheldon Thanks for the prompt reply!

Yep, I've got more logs from that period - Do you have somewhere I can send you a private link to see them?

The project is using Rails 6.1.4 with Ruby 3.0.2.
I haven't used PgBouncer and am using PostgreSQL server version 12.

Here's a snapshot of the database.yml file:

default: &default
  adapter: postgresql
  encoding: unicode
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 10 } %>
  timeout: 5000

test:
  <<: *default
  host: localhost
  port: 7433
  username: rails
  password: rails
  database: rails_test

development:
  <<: *default
  host: localhost
  port: 7432
  username: rails
  password: rails
  database: rails_dev

staging:
  <<: *default
  database: rails_staging

production:
  <<: *default
  database: rails_production

Thanks again!

@bensheldon
Copy link
Owner

Happy to help. You can email them to me: [email protected]

@bensheldon
Copy link
Owner

Thanks for sending over those logs. I don't have an explanation for why this has happened.

There isn't anything special about ActionMailer::MailDeliveryJob.

I haven't observed this happen before, and there is GoodJob test code that create a big concurrency load to surface these kinds of problems. Some suggestions:

  • Log the output of ActiveRecord::Base.connection.object_id to see if the same database ID is being shared across threads.
  • Rails has a mechanism for sharing database connections across threads for tests. You can see if this is enabled: ActiveRecord::Base.instance_variable_get(:@lock_thread) #=> should be nil/false. This code replaced a much older trick but it doesn't look like you're working in a legacy codebase.

I am stumped 🤔

@bensheldon
Copy link
Owner

@tobyaherbert I think I may have duplicated what you're seeing in #388. which is related to code reloading in the Development environment.

I double-checked the logs you sent me, and they were for Development too. I just wanted to doublecheck that this behavior wasn't seen in the Production environment (because that would imply something different than #388).

@tobyaherbert
Copy link
Author

@bensheldon I got sidetracked and completely forgot to respond to you, apologies!

Yes, we were testing our email sending feature in the development environment when we encountered the duplications.
We switched to using serial queues for most things which has mitigated the issues we were having (though I never tried it with more than one thread in the production environment).

@bensheldon
Copy link
Owner

@tobyaherbert I totally understand. I have a fix for the problem released in v2.3.1 if you want to give that a try.

@tobyaherbert
Copy link
Author

@bensheldon Ah, that's great - I'll take a peek when I get a sec 👍

@bensheldon
Copy link
Owner

@tobyaherbert I'm closing this because of age. Please comment if you're still experiencing problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants