-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bunny with TES backend does not handle scatter #382
Comments
Can you upload the workflow file you tried? Thanks |
Here is the packed version. The input json would look like:
|
This is an issue we noticed recently that rabix has with starting jobs that have no defined input ports. I've created a branch with a quick fix and will be releasing a 1.0.4 prerelease binaries soon that will contain this and some other fixes. (Also, rabix had some issues unpacking your wf, I will investigate and create a separate issue) |
that's interesting cause I am pretty sure I ran the standalone tool with bunny, but maybe I was using a jar from that branch. Thanks for the info. Let me know if the issue with the wf is my own error and I can patch it. I do know the creating a file with workdir and loading its contents does work with rabix/funnel, but once I added in the scatter is when I ran into problems. |
Ok @milos-ljubinkovic so I tried running the unpacked workflow with some slight adjustments to not have any null inputs or baseCommands; however I get a jackson-based exception (full logs are attached as
So I rebuilt bunny with updated jackson (2.9.2) and I no longer get that error, but instead it is an error about the engine not being able to glob the output file (my current use case doesn't have a shared fs but does use SLURM). I have attached these logs as bugfix.noinputs.orig_jackson.logs.gz Edit: Just noting that I added some silly meaningless inputs and baseCommand (e.g., |
Could you check the funnel logs after this collection fails? I'm interested where are the output files uploaded. This might be a bug where rabix is looking for them in the wrong folder or something. There should be something like this in funnel logs (might depend on the log-level in your funnel config):
Might be that for some reason the url param in the "Starting upload" log part isn't in the directory that is defined as |
I haven't fully grokked the issue here, but wanted to note that we recently had a bug with the output file log in Funnel, ohsu-comp-bio/funnel#290 . Not sure it applies here. |
@milos-ljubinkovic could this be due to the fact that our SLURM cluster in running on nodes without a shared fs? Within each node the |
I don't know how funnel handles that situation. We just declare a shared directory (rabix.tes.storage.base) between funnel server and rabix executor because rabix has to do some postprocessing on the outputs after execution, no idea how data transfer between slurm nodes and funnel server works. If funnel skipped over or had an error during the upload this is the issue, but if it succeeded in uploading outputs (anywhere, maybe a different dir. that rabix.tes.storage.base) after each slurm node completed then it's is a different one. |
If you set |
I'm having all sorts of issues with s3 backend using the branch of bunny that I have been working on. I did go through and add the |
I think I never got the s3 urls to work in funnel when they contain access points so rabix just defaults to aws' default s3 host. @buchanae Have you tested s3 support with other hosts and what should the urls look like? Might be I just used the wrong form. |
@milos-ljubinkovic it seems like bunny is adding in a
|
Funnel doesn't support endpoints for S3 yet, and we're not sure Funnel's S3 client will work outside of AWS. ohsu-comp-bio/funnel#338 |
@buchanae do you know if the adding of the |
@kmhernan funnel does not modify the URIs that are given to it. That |
Bunny is adding that because I've hastily edited the s3 lib I use to generate urls without endpoints that funnel supports. When funnel gets support for endpoints I'll revert it back. In the meantime I guess I could remove the null but funnel will still search for the file on the default amazon aws endpoint so it wouldn't matter in this case. If you have files that are on amazon aws and declare them in task inputs without the endpoint it should be sent to funnel without null. |
@milos-ljubinkovic @kmhernan I am working on extending our s3 support. I am hoping to get a PR up by tomorrow that you could test. |
@kmhernan @milos-ljubinkovic -> ohsu-comp-bio/funnel#356 As I note in the PR there are still some outstanding issues I need to address. But this should unblock further testing / development. |
great @adamstruck .. I just want to let you know @milos-ljubinkovic feel free to tag me in something to test on our ceph/cleversafe s3 stores when you get a chance |
@milos-ljubinkovic funnel was updated to use the new TES v0.3 spec recently. To use these new features we will need to update the TES model in bunny. Let me know if you would like help with that. |
@milos-ljubinkovic i made some changes locally to test the newest funnel release and attached the diffs here. Pretty basic, but no hardcore testing. |
I've updated the bunny TES models to v0.3 and I've been trying to test the generic s3 support from your pull request branch, but I can't get the urls with endpoints to work. Is the config and file url supposed to look like this or do I have the wrong format:
I can generate file urls without the endpoint and it works, but is this a valid solution? |
The config isn't correct. Please look at the new docs: Sorry about the confusing config structure. It's something we are working on. |
@adamstruck It seems like the S3 settings in the config file aren't propagated to the worker node... This is the S3 section I see on the worker node which does not match my input config
|
The config isn't propagated to the worker node unless there is a shared file system. You can put the config on the worker and modify the submit template to point to that file. I'll see what I can do to make this more straight forward when I get back next week. Also I'll need to investigate how your worker got that config structure since non of the storage backends in this branch use the 'fromEnv' keyword. |
@adamstruck my test of funnel from your fork with my cleversafe endpoint was successful as long as I didn't have the |
Added support for s3 endpoints in the latest bunny release. The lib I used had some issues with google storage s3 due to differences in the request parameters, so not sure it will work with IBM's implementation. |
Scatter seems to be working for me. Whether or not all tasks are started simultaneously is determined by the pool size which can now be configured ( |
When running with the TES backend (funnel) using a workflow that has
scatter
nothing seems to actually be submitted to the TES. I have provided the DEBUG log of the run. After the last line bunny just seems to hang.The text was updated successfully, but these errors were encountered: