-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregating the output artifacts of parallel steps (fan-in) #934
Comments
When aggregating and importing the artifacts generated by steps in a loop, I think the most useful option may be to allow subsequent steps to reference a composite artifact that is a merge of the contents of all the output artifacts into a single composite artifact. Each instance of the loop could use the iteration index or other parameter passed in to appropriately name the output file/directory so that the merge does not overwrite content. This gives the user control over how to name/structure the layout of the output artifact. Example: The composite artifact could be referenced as What do you think? |
What if the user does not have control over the location of the output artifact. For example, what if the container being run is some off-the-shelf docker image, and the user wants something like |
Even with an off-the-shelf image, I think the user could override the If desired, we could support a second form for accessing the composite output artifact. E.g. if the input artifact is specified as, for example, One draw back with this approach is that it only works well with loops. I think the first approach is more general. |
This seems to be a fairly stale issue, but it would be extremely handy for easily parallelizable tasks such as scraping multiple sources and generating large datasets that need to be merged in the end and analysed as a whole. To be more specific, my use case is something like this:
This last step is exactly the use case for the feature described in this issue, everything else I was able to do. |
Since this bug was filed, we now support artifacts with |
@jessesuen can you point me to that pattern where multiple steps can output to a common s3 directory? |
Need to write a proper example, but the idea is that you would disable .tgz archiving on a file like so: And then the subsequent step would recursively download the parent s3 key as a "directory." The enhancement that was made in v2.2, is that if the S3 location appears to be a "directory" instead of a file, it performs a recursive download of all of the contents of that "directory." Directory is in quotes because S3 is really just a key/value store. |
So, does your container job need to be aware of its "index" or is the template using it to save the artifacts with unique names (using the item index i.e.)? |
@jessesuen, did you ever get a chance to add an example which uses |
@jessesuen extremely interesting as well to perform experiments, FYI (https://www.ovh.com/blog/simplify-your-research-experiments-with-kubernetes/) |
This seems potentially broken for GCS per #1351 |
@jessesuen: Is there a particular concern in trying to support tasks.X.outputs.artifacts when using loops, and simply extracting all matches? It's up to the workflow designer to ensure the overlay is correct. @edlee2121 @fj-sanchez I'm suggesting we skip any complexity around indexes and just use unique names. For example, this is working just fine in a loop to get the artifact onto storage:
All I'd need to pull out of storage is to just loop through the list, much as parameter aggregation does, and unpack the archives into the target folder. Unfortunately, due to the GCS folders bug, I can't use the folder clone technique to just download all of the {{workspace.name}} key. #1351 -- I did go ahead and write a template to just use the cloud SDK and gsutil cp to do that work -- mentioned in the comment to that report. Parameter aggregation is a little funky, {{pod.name}} is evaluated prior to the loop and so is not the same pod that actually writes out the artifact. This PR may address that: #1336 |
Does anybody have a working example of how to do this? If not, I'll work something out myself. Edit 1: One thing I tried was using Edit 2: I got something to work! Here is my attempt: |
I have used that approach initially as well, it works as long as all steps are running on the same Node or the volume runs in |
@sebinsua This example rocks, thanks for sharing! I was able to get this running with @TekTimmy if you are still blocked on this, I provided a link to the helm chart below. https://github.com/helm/charts/tree/master/stable/nfs-server-provisioner |
Document the pattern. #2549 |
I'm having a little trouble figuring out the state of this. Looks like there's a solution that works on a volume mount but not on an artifact store? EDIT: I see now, you can use the "hard-wired" s3 approach, specifying endpoint, bucket, etc. along with a directory key. Doesn't quite work for me because transferring the whole directory to the container is too much. I need to be able to use a parameter as part of the key name, or something like a withArtifact that would work similar to withParam. EDIT2: I just searched for withArtifact and found this #2758 |
Hmm. Should not have been closed. |
I've created an example of a map-reduce job in Argo Workflows that aggregates outputs. Please take a look |
I like this form, but I think there needs to be a way to access artifacts per-output. For example |
Interesting example. |
The map-reduce example gave me an idea how Argo could solve the artifact aggregation with minimal effort.
|
Perhaps we could implement artifact aggregation the same way as for parameters - the loop node should collect per-output artifact lists and the We could make it possible to consume a list of artifacts (just for illustration - most users won't use this directly - only the
The artifact lists can be produced by loop nodes and passed to the aggregators:
|
It's simply that map reduce patterns need some enthusiastic support. I found the easiest way to explore the space was to actually mock out the yaml flows as objects. In one experiment, I just used javascript with a dag library. Looking forward to seeing what the community comes up with (my experiment was nuked by the client); I'm just saying: yes, map reduce needs to happen, but you can experiment with it in just thinking about an in memory dag. |
This can be achieved for bucket based artifacts (S3/GCP/OSS) using key-only artifacts. See #4618 |
Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST
What happened:
Separating this issue from #861 to handle aggregation of output artifacts. Similar to output parameters which have been expanded using loops, we need some mechanism to aggregate artifacts from parallel steps. With parameters, the solution was to introduce a new variable,
steps.XXXX.outputs.parameters
as a json list. For artifacts we need something similar. The trick is how we would place aggregate artifacts into a subsequent pod.The text was updated successfully, but these errors were encountered: