Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrive-server use cases #325

Closed
adrianna-chang-shopify opened this issue Feb 9, 2021 · 9 comments
Closed

arrive-server use cases #325

adrianna-chang-shopify opened this issue Feb 9, 2021 · 9 comments

Comments

@adrianna-chang-shopify
Copy link
Contributor

adrianna-chang-shopify commented Feb 9, 2021

Moving the conversation from https://github.com/Shopify/super_maintenance/issues/8#issuecomment-775932361 over to our existing repo

From @tgwizard :

👋 I want to describe a use-case that I'm not sure fits well with the current CSV Task structure. In (app), we often run migrations and backfills on a large dataset. When we write such migrations, we often want to try them out on a subset of data. This can be all orders for a specific user, or the first 10k orders with id > X. We may often want to run them in parallel too (at least the process part), which could potentially be done by launching concurrent maintenance tasks on separate subsets of the data

Is that possible to do with a CSV Task? Seems a bit hard to do, since for CSV tasks only the process method seems like it should be implemented, whereas I imagine the parameters being applicable to the collection method.

(I imagine we'd write 2 separate maintenance tasks for the two flows above, one for single user, one for id + limit).

Is there some other way we could write these kind of migrations using the maintenance tasks gem?

@adrianna-chang-shopify
Copy link
Contributor Author

Hey @tgwizard , can you tell me a bit more about why you want to use a CSV for these use cases? You can run a Task with a subset of a large db table by specifying an ActiveRecord relation for collection:

module Maintenance
  class SubsetOrderTask < MaintenanceTasks::Task
    def collection
        Order.where(user_id: 123)
        # or, Order.where(id: 10000..20000)
    end

    def count
      collection.count
    end

    def process(order)
      # Do something
    end
  end
end

As for concurrency, it's definitely something that's on our radar, but the gem isn't able to handle that yet. You can indeed write separate tasks to handle different subsets of the data and run them concurrently. I'm realizing now that CSV Tasks might have been appealing because you can customize which collection the Task is processing? So yes, if you wanted to, you could export a CSV with ids for all the orders you wanted to process, and then upload this to a CSV-processing Task. This way you could reuse a single Task across multiple datasets BUT the drawback is that you can only run a single instance of a Task at a time, so you wouldn't be able to achieve this concurrently. (So if concurrency is important, your best bet right now would be to just write separate Tasks for different datasets).

Let me know if this answers your question(s) or if I've misunderstood anything 😄

@tgwizard
Copy link
Member

tgwizard commented Feb 9, 2021

Hmm, right, that SubsetOrderTask conceptually exactly what I want to do, but I don't want to write a specific task for each user we'd run it for.

Similarly, the extract-order-ids-to-csv and run that as a CSV task kinda works, but that doesn't guarantee that all a user's orders have been migrated (if there are new orders added after the CSV has been generated).

A CSV that contains the user IDs to run for would be a good middleground, but then the maintenance task would still need to adjust the collection implementation on top of the CSV task's version, to do the Order.where(user_id: id_from_csv) part.

@adrianna-chang-shopify
Copy link
Contributor Author

Ah, gotcha. So instead of having it be "CSV as a way to define a collection of rows to process", it would be "CSV as a way to define parameters that can be used in conjunction with another type of collection, like an ActiveRecord relation".

I think this would warrant a new type of task entirely. The CSV tasks are really intended to provide a collection for iteration, as opposed to allowing users to build new collections off of the contents of the CSV. (I realize that https://github.com/Shopify/super_maintenance/issues/8 is misleading because we talk about CSVs within the context of writing recurring tasks that take params, and is different from the usage of the CSV Task that is currently being built into the gem as it stands today).

As a workaround, to achieve:

A CSV that contains the user IDs to run for would be a good middleground, but then the maintenance task would still need to adjust the collection implementation on top of the CSV task's version, to do the Order.where(user_id: id_from_csv) part.

You could commit a CSV to your filesystem, and write a Task that does Order.where(user_id: id_from_csv) and manually read the CSV yourself. I realize that this isn't ideal, however, and I will keep note of this as a use case we want to explore further and potentially offer support for. 😄

@tgwizard
Copy link
Member

Right, ideally we wouldn't need to deploy any changes to maintenance tasks or CSV files to run the task for a different subset of data. Ideally we wouldn't need to edit a CSV and upload that to target a new subset either - a UI with some kind of parameter input would be the ideal for us.

@etiennebarrie
Copy link
Member

What if you run the task and then cancel it manually after some time? That works for some variation of subset of records.

Also we're not making this a public API just yet, but the way CsvCollection works may give you a seam to get into. It basically just defines a collection method for you, and also does what needs to be done to bring the CSV from the file input field to the collection. So maybe this could work:

module Maintenance
  class SubsetTask < MaintenanceTasks::Task
    csv_collection

    def collection
      csv = super
      SomeActiveRecord.where(user_id: csv.map {|row| row['user_id'] })
    end

    def process(record)
      #
    end
  end
end

csv_collection will give you the file input on the task page, and calling super will get you the CSV object.

Not sure how likely we are to break this API, so you may want to add a test that checks that your maintenance task works with Runner.run("Maintenance::SubsetTask", csv_file: { io: File.open('test/fixtures/files/sample.csv'), filename: 'sample.csv' }), because that is a public API, so that should always work.

@tgwizard
Copy link
Member

Oh, that actually sounds like a good option for us @etiennebarrie! Slightly tedious to create a CSV when adjusting which users to target, but that is likely still less tedious than the other ways we have of doing this.

@adrianna-chang-shopify
Copy link
Contributor Author

That's a super interesting point @etiennebarrie - obviously it's not-so-nice to hack with how CsvCollection works, but our set up with ActiveStorage gives us a nice extension point for developing a new sort of Task that accepts additional parameters via a CSV.

I wonder if we could develop an API that looks something like this:

module Maintenance
  class SubsetTask < MaintenanceTasks::Task
    with_params

    def collection
      SomeActiveRecord.where(user_id: params[:user_ids])
    end

    def process(record)
      #
    end
  end
end

And then in the UI, you would be expected to attach a CSV like this:

user_ids
1
2
3

And then Task would need to:

  • Define an attr_accessor, :params, for Tasks that have a with_params method specified. Defaults to {}
  • Allow CSVs to be uploaded for these types of Tasks
  • Parse each column of the CSV and store it in params[<header_name>] on the Task

@adrianna-chang-shopify
Copy link
Contributor Author

Although arguably for ^ if you wanted to support multiple parameters, you'd have to do something like this:

user_ids, product_ids
1, 100
2, 101
3, 102

which is stretching the design of CSVs since each row is not actually a cohesive element. Maybe a different input type would work even better, like JSON?

{
  "shop_ids": [1, 2, 3],
  "product_ids": [100, 101, 102]
}

@adrianna-chang-shopify
Copy link
Contributor Author

adrianna-chang-shopify commented Sep 8, 2021

Closing this, custom parameter support has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants