Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ActiveRecord::Batches::BatchEnumerator collections #409

Merged
merged 4 commits into from
Jun 3, 2021

Conversation

adrianna-chang-shopify
Copy link
Contributor

@adrianna-chang-shopify adrianna-chang-shopify commented May 10, 2021

Closes: #390, #401

This PR allows users to specify batch collections via the following API:

# app/tasks/maintenance/update_posts_in_batches_task.rb
module Maintenance
  class UpdatePostsInBatchesTask < MaintenanceTasks::Task
    def collection
      Post.in_batches
    end

    def count
      collection.count
    end

    def process(batch_of_posts)
      batch_of_posts.update_all(content: "New content added on #{Time.now.utc}")    
    end
  end
end

Users are expected to use #in_batches. We use EnumeratorBuilder#active_record_on_batch_relations under the hood to build an enumerator that yields ActiveRecord::Relations - we get the ActiveRecord collection via @relation and the batch size via@of on the Batch Enumerator.

I've also tweaked Ticker#tick to take an increment, defaulting to 1. In TaskJob#each_iteration, I've tweaked the call to #tick to use input.size if input is_a?(ActiveRecord::Relation). This makes it so that tick count is still based on the number of records, rather than the number of batches, if we're processing batches.

@adrianna-chang-shopify
Copy link
Contributor Author

adrianna-chang-shopify commented May 10, 2021

@etiennebarrie I've requested your review on this. CI obviously fails because the Job Iteration changes haven't shipped yet, but tests pass with the changes from Shopify/job-iteration#86.

Hoping for another review on the Job Iteration PR from Job Patterns. In the meantime, I figured there was no reason to wait on getting feedback on this 😄

@etiennebarrie
Copy link
Member

use input.size if input is_a?(ActiveRecord::Relation).

I'm not so sure about this one, that will cause an additional COUNT query for each iteration, even if we don't use the increment because the last ticker update was recent enough.

Inside each_iteration, the relation is a batch relation right? So its number of elements should be equal to the batch size most of the time, and smaller for the last batch, maybe we could use that instead?

@adrianna-chang-shopify
Copy link
Contributor Author

I'm not so sure about this one, that will cause an additional COUNT query for each iteration, even if we don't use the increment because the last ticker update was recent enough.

I think we're actually okay here - when we call ActiveRecordCursor#next_batch, we do:
update_from_record(records.last) unless records.empty?. Naturally, #exists? loads the relation - so by the time we get to calling input.size in each_iteration, that relation should already be loaded, and shouldn't do another COUNT query.

I actually implemented it using just the batch size as you'd suggested initially, but then I moved away from that because the margin of error can be pretty big - if your batch size is 1000 and you have 1003 records to process, your final tick count will be 2000, which is weird.

I think we should be good with using #size instead, since this won't actually result in a SQL query.

@etiennebarrie
Copy link
Member

Hum, previously the records.empty? wouldn't have caused any additional query though because the records were loaded already. Since we're going to need at least one query to get the last record, I think we could change that to:

last_record = records.last
update_from_record(last_record) if last_record

or change update_from_record to early return if given nil.

Today it's one query to load the records (in job-iteration) + one query for updating them (in the app code).
E.g. records = relation.to_a + Post.where(id: posts.map(&:id)).update_all(published: true).

Ideally we'd move to one query to get the next cursor value + the app code.
relation.pluck(@columns) or something in job_iteration + relation.update_all(published: true).

Hopefully even if it doesn't help much with the SQL, it helps by not loading so much stuff in memory?

@adrianna-chang-shopify
Copy link
Contributor Author

cc @Shopify/rails @Shopify/core-stewardship

@adrianna-chang-shopify adrianna-chang-shopify force-pushed the batch-collections-w-batch-enumerator branch from 74f8d30 to 63ab1b1 Compare May 31, 2021 19:45
@adrianna-chang-shopify
Copy link
Contributor Author

cc @etiennebarrie

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
app/jobs/concerns/maintenance_tasks/task_job_concern.rb Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
app/jobs/concerns/maintenance_tasks/task_job_concern.rb Outdated Show resolved Hide resolved
app/jobs/concerns/maintenance_tasks/task_job_concern.rb Outdated Show resolved Hide resolved
app/jobs/concerns/maintenance_tasks/task_job_concern.rb Outdated Show resolved Hide resolved
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the batch-collections-w-batch-enumerator branch 2 times, most recently from 4a259ae to 76f79f1 Compare June 1, 2021 21:11
Copy link
Member

@etiennebarrie etiennebarrie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more wording propositions, feel free to tweak. The rest is great 🎉 :shipit:

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the batch-collections-w-batch-enumerator branch from c826f58 to 723ee89 Compare June 3, 2021 13:01
@adrianna-chang-shopify adrianna-chang-shopify force-pushed the batch-collections-w-batch-enumerator branch from 723ee89 to d7b7dc5 Compare June 3, 2021 17:45
@adrianna-chang-shopify adrianna-chang-shopify merged commit 16b1070 into main Jun 3, 2021
@adrianna-chang-shopify adrianna-chang-shopify deleted the batch-collections-w-batch-enumerator branch June 3, 2021 17:48
@shopify-shipit shopify-shipit bot temporarily deployed to rubygems June 7, 2021 15:25 Inactive
@lamp
Copy link

lamp commented Jun 8, 2021

Thank you so much for this! 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support ActiveRecord::Batches::BatchEnumerator as a collection (or don't require a collection at all)
5 participants