bunny with TES backend does not handle scatter #382

kmhernan · 2017-11-09T17:35:27Z

When running with the TES backend (funnel) using a workflow that has scatter nothing seems to actually be submitted to the TES. I have provided the DEBUG log of the run. After the last line bunny just seems to hang.

[2017-11-09 17:28:03.804] [DEBUG] DAGCache initialized with size=16
[2017-11-09 17:28:03.816] [DEBUG] ApplicationCache initialized with size=16
[2017-11-09 17:28:04.268] [INFO] Backend f5ed244a-da69-4067-bcb7-859d7e70e613 registered.
[2017-11-09 17:28:04.291] [DEBUG] Start Job Job [id=null, parentId=null, rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, name=null, status=PENDING, message=null, config=null, inputs={terms=[hello01, hello02, hello03, hello04]}, outputs=null]
[2017-11-09 17:28:04.449] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.457] [INFO] Job null rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 started
[2017-11-09 17:28:04.459] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.524] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.524] [DEBUG] Cache size=1
[2017-11-09 17:28:04.527] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.528] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.529] [DEBUG] Cache size=1
[2017-11-09 17:28:04.529] [DEBUG] JobRecord root. Decrementing port terms.
[2017-11-09 17:28:04.529] [DEBUG]
Job root input counters:
 -- Input port terms, counter=0

[2017-11-09 17:28:04.529] [DEBUG]
Job root output counters:

[2017-11-09 17:28:04.529] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.530] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.530] [DEBUG] Cache size=1
[2017-11-09 17:28:04.530] [DEBUG]  --- JobRecord root is ready. Job isBlocking=false
 --- Input port terms, isScatter=false, isBlocking false
 --- All scatter ports []

[2017-11-09 17:28:04.531] [DEBUG]
 -- JobRecord root.run_load_contents, isBlocking false
 -- Input port term, isScatter true
 -- Output port json_file, isScatter false
 -- Output port out_value, isScatter false

[2017-11-09 17:28:04.534] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.534] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.534] [DEBUG] Cache size=1
[2017-11-09 17:28:04.534] [DEBUG] JobRecord root.run_load_contents. Decrementing port term.
[2017-11-09 17:28:04.534] [DEBUG]
Job root.run_load_contents input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.534] [DEBUG]
Job root.run_load_contents output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.535] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.535] [DEBUG] Cache size=1
[2017-11-09 17:28:04.537] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.537] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.537] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.537] [DEBUG] Cache size=1
[2017-11-09 17:28:04.537] [DEBUG] JobRecord root.run_load_contents.1. Decrementing port term.
[2017-11-09 17:28:04.538] [DEBUG]
Job root.run_load_contents.1 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.538] [DEBUG]
Job root.run_load_contents.1 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.538] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.538] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.538] [DEBUG] Cache size=1
[2017-11-09 17:28:04.538] [DEBUG]  --- JobRecord root.run_load_contents.1 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.538] [DEBUG]
 -- JobRecord root.run_load_contents.1.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.539] [DEBUG]
 -- JobRecord root.run_load_contents.1.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.540] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.540] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.541] [DEBUG] Cache size=1
[2017-11-09 17:28:04.541] [DEBUG] JobRecord root.run_load_contents.1.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.541] [DEBUG]
Job root.run_load_contents.1.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.541] [DEBUG]
Job root.run_load_contents.1.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.542] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.542] [DEBUG] Cache size=1
[2017-11-09 17:28:04.568] [DEBUG]
 ---- JobRecord root.run_load_contents.1

[2017-11-09 17:28:04.568] [INFO] Container job root.run_load_contents.1 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.568] [DEBUG] onJobRootReady(jobId=18b4980c-72e8-40aa-9c3a-7e41bfd73a06)
[2017-11-09 17:28:04.569] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.569] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.569] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.569] [DEBUG] Cache size=1
[2017-11-09 17:28:04.569] [DEBUG] JobRecord root.run_load_contents.2. Decrementing port term.
[2017-11-09 17:28:04.569] [DEBUG]
Job root.run_load_contents.2 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.569] [DEBUG]
Job root.run_load_contents.2 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.569] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.570] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.570] [DEBUG] Cache size=1
[2017-11-09 17:28:04.570] [DEBUG]  --- JobRecord root.run_load_contents.2 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.570] [DEBUG]
 -- JobRecord root.run_load_contents.2.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.571] [DEBUG]
 -- JobRecord root.run_load_contents.2.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.572] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.572] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.572] [DEBUG] Cache size=1
[2017-11-09 17:28:04.572] [DEBUG] JobRecord root.run_load_contents.2.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.572] [DEBUG]
Job root.run_load_contents.2.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.572] [DEBUG]
Job root.run_load_contents.2.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.572] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.573] [DEBUG] Cache size=1
[2017-11-09 17:28:04.574] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

[2017-11-09 17:28:04.574] [INFO] Container job root.run_load_contents.2 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.575] [DEBUG] onJobRootReady(jobId=e297562f-5384-44f4-b47a-045a84e70283)
[2017-11-09 17:28:04.575] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.575] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.576] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.576] [DEBUG] Cache size=1
[2017-11-09 17:28:04.576] [DEBUG] JobRecord root.run_load_contents.3. Decrementing port term.
[2017-11-09 17:28:04.576] [DEBUG]
Job root.run_load_contents.3 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.576] [DEBUG]
Job root.run_load_contents.3 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.576] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.576] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.576] [DEBUG] Cache size=1
[2017-11-09 17:28:04.576] [DEBUG]  --- JobRecord root.run_load_contents.3 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.577] [DEBUG]
 -- JobRecord root.run_load_contents.3.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.577] [DEBUG]
 -- JobRecord root.run_load_contents.3.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.578] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.579] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.579] [DEBUG] Cache size=1
[2017-11-09 17:28:04.579] [DEBUG] JobRecord root.run_load_contents.3.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.579] [DEBUG]
Job root.run_load_contents.3.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.579] [DEBUG]
Job root.run_load_contents.3.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.579] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.579] [DEBUG] Cache size=1
[2017-11-09 17:28:04.582] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

[2017-11-09 17:28:04.583] [INFO] Container job root.run_load_contents.3 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.583] [DEBUG] onJobRootReady(jobId=d93de3c5-5568-4f2a-9423-0ebdebdd5486)
[2017-11-09 17:28:04.584] [DEBUG] Reset output port counters for root.run_load_contents to 4
[2017-11-09 17:28:04.584] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.584] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.584] [DEBUG] Cache size=1
[2017-11-09 17:28:04.585] [DEBUG] JobRecord root.run_load_contents.4. Decrementing port term.
[2017-11-09 17:28:04.585] [DEBUG]
Job root.run_load_contents.4 input counters:
 -- Input port term, counter=0

[2017-11-09 17:28:04.585] [DEBUG]
Job root.run_load_contents.4 output counters:
 -- Output port json_file, counter=1
 -- Output port out_value, counter=1

[2017-11-09 17:28:04.585] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.586] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.586] [DEBUG] Cache size=1
[2017-11-09 17:28:04.586] [DEBUG]  --- JobRecord root.run_load_contents.4 is ready. Job isBlocking=false
 --- Input port term, isScatter=true, isBlocking false
 --- All scatter ports [term]

[2017-11-09 17:28:04.586] [DEBUG]
 -- JobRecord root.run_load_contents.4.run_get_uuid, isBlocking false
 -- Output port uuid, isScatter false

[2017-11-09 17:28:04.587] [DEBUG]
 -- JobRecord root.run_load_contents.4.run_load_contents, isBlocking false
 -- Input port uuid, isScatter false
 -- Input port my_string, isScatter false
 -- Output port test_json, isScatter false
 -- Output port test_value, isScatter false

[2017-11-09 17:28:04.589] [DEBUG] Root Job 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 goes to EventProcessor 1
[2017-11-09 17:28:04.589] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents.run_load_contents found in cache
[2017-11-09 17:28:04.589] [DEBUG] Cache size=1
[2017-11-09 17:28:04.590] [DEBUG] JobRecord root.run_load_contents.4.run_load_contents. Decrementing port my_string.
[2017-11-09 17:28:04.590] [DEBUG]
Job root.run_load_contents.4.run_load_contents input counters:
 -- Input port uuid, counter=1
 -- Input port my_string, counter=0

[2017-11-09 17:28:04.590] [DEBUG]
Job root.run_load_contents.4.run_load_contents output counters:
 -- Output port test_json, counter=1
 -- Output port test_value, counter=1

[2017-11-09 17:28:04.591] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root.run_load_contents found in cache
[2017-11-09 17:28:04.591] [DEBUG] Cache size=1
[2017-11-09 17:28:04.592] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

 ---- JobRecord root.run_load_contents.4

[2017-11-09 17:28:04.593] [INFO] Container job root.run_load_contents.4 rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.593] [DEBUG] onJobRootReady(jobId=bea60dc2-be00-456d-9fb5-8dd00f7b95f9)
[2017-11-09 17:28:04.593] [DEBUG] DAGNode rootId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173, id=root found in cache
[2017-11-09 17:28:04.593] [DEBUG] Cache size=1
[2017-11-09 17:28:04.613] [DEBUG]
 ---- JobRecord root.run_load_contents.1

 ---- JobRecord root.run_load_contents.2

 ---- JobRecord root.run_load_contents.3

 ---- JobRecord root.run_load_contents.4

 ---- JobRecord root

[2017-11-09 17:28:04.614] [INFO] Container job root rootId: 267139cd-b0d4-497f-b4cb-ea3a6fd3c173 id ready.
[2017-11-09 17:28:04.614] [DEBUG] onJobRootReady(jobId=267139cd-b0d4-497f-b4cb-ea3a6fd3c173)
[2017-11-09 17:28:04.614] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryJobRecordRepository@6052b831 flushed 14 item(s). Cache hits 53.
[2017-11-09 17:28:04.615] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryVariableRecordRepository@45de06b0 flushed 38 item(s). Cache hits 25.
[2017-11-09 17:28:04.615] [DEBUG] org.rabix.engine.store.memory.impl.InMemoryLinkRecordRepository@75c73d5e flushed 31 item(s). Cache hits 5.

The text was updated successfully, but these errors were encountered:

milos-ljubinkovic · 2017-11-09T19:59:36Z

Can you upload the workflow file you tried? Thanks

kmhernan · 2017-11-10T00:36:51Z

Here is the packed version. The input json would look like:

{
    "terms": ["hello01", "hello02", "hello03", "hello04"]
}

scatter_test.packed.cwl.gz

milos-ljubinkovic · 2017-11-10T15:09:05Z

This is an issue we noticed recently that rabix has with starting jobs that have no defined input ports. I've created a branch with a quick fix and will be releasing a 1.0.4 prerelease binaries soon that will contain this and some other fixes.

(Also, rabix had some issues unpacking your wf, I will investigate and create a separate issue)

kmhernan · 2017-11-10T15:35:52Z

that's interesting cause I am pretty sure I ran the standalone tool with bunny, but maybe I was using a jar from that branch. Thanks for the info. Let me know if the issue with the wf is my own error and I can patch it. I do know the creating a file with workdir and loading its contents does work with rabix/funnel, but once I added in the scatter is when I ran into problems.

kmhernan · 2017-11-10T19:49:35Z

Ok @milos-ljubinkovic so I tried running the unpacked workflow with some slight adjustments to not have any null inputs or baseCommands; however I get a jackson-based exception (full logs are attached as bugfix.noinputs.orig_jackson.logs.gz.

java.util.concurrent.ExecutionException: java.lang.IllegalStateException: com.fasterxml.jackson.databin
 at [Source: N/A; line: -1, column: -1]

So I rebuilt bunny with updated jackson (2.9.2) and I no longer get that error, but instead it is an error about the engine not being able to glob the output file (my current use case doesn't have a shared fs but does use SLURM). I have attached these logs as bugfix.noinputs.updated_jackson.logs.gz. Now what is interesting is that I can run the same workflow that doesn't use scatter and it will work. To be helpful, I have attached a tar.gz containing the workflows. The scatter_test.cwl is the scattered one and an example input json is scatter.inputs.json. The contents_workflow.cwl is the workflow that the scatter_test.cwl workflow scatters and it completes successfully using the basic.inputs.json file.

bugfix.noinputs.orig_jackson.logs.gz
bugfix.noinputs.updated_jackson.logs.gz
test_cwl.tar.gz

Edit: Just noting that I added some silly meaningless inputs and baseCommand (e.g., ls) to see if that was an issue since I had them as null more or less

milos-ljubinkovic · 2017-11-13T12:49:26Z

error about the engine not being able to glob the output file

Could you check the funnel logs after this collection fails? I'm interested where are the output files uploaded. This might be a bug where rabix is looking for them in the wrong folder or something.

There should be something like this in funnel logs (might depend on the log-level in your funnel config):

local                Starting upload
attempt              0
index                0
taskID               b84p40blt51hgl189heg
time                 2017-11-13T13:37:28+01:00
timestamp            2017-11-13T12:37:28.042344Z
url                  file:///tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid

....

local                TASK_OUTPUTS
attempt              0
index                0
outputs              []*tes.OutputFileLog{
    &tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/job.err.log", SizeBytes:0},
    &tes.OutputFileLog{Url:"/tmp/tes/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", Path:"/Users/milosljubinkovic/Downloads/funnel-work-dir/b84p40blt51hgl189heg/Users/milosljubinkovic/Downloads/test_cwl/4c20af28-99b1-4ca2-8c89-94d0fd25c4d2/root/run_load_contents/3/run_get_uuid/uuid.txt", SizeBytes:36},
}

Might be that for some reason the url param in the "Starting upload" log part isn't in the directory that is defined as rabix.tes.storage.base in rabix config or that "uuid.txt" in the "TASK_OUTPUTS" log hasn't been created.

buchanae · 2017-11-13T15:37:09Z

I haven't fully grokked the issue here, but wanted to note that we recently had a bug with the output file log in Funnel, ohsu-comp-bio/funnel#290 . Not sure it applies here.

kmhernan · 2017-11-14T18:24:54Z

@milos-ljubinkovic could this be due to the fact that our SLURM cluster in running on nodes without a shared fs? Within each node the rabix.tes.storage.base is there, but across nodes it is obviously not the same due to it not being on a shared fs?

milos-ljubinkovic · 2017-11-14T18:49:33Z

I don't know how funnel handles that situation. We just declare a shared directory (rabix.tes.storage.base) between funnel server and rabix executor because rabix has to do some postprocessing on the outputs after execution, no idea how data transfer between slurm nodes and funnel server works.

If funnel skipped over or had an error during the upload this is the issue, but if it succeeded in uploading outputs (anywhere, maybe a different dir. that rabix.tes.storage.base) after each slurm node completed then it's is a different one.

buchanae · 2017-11-14T19:56:31Z

If you set rabix.tes.storage.base to a bucket storage location, e.g. s3://rabix-bucket/shared, is rabix able to use that?

kmhernan · 2017-11-14T22:23:39Z

I'm having all sorts of issues with s3 backend using the branch of bunny that I have been working on. I did go through and add the META-INF directory to get rid of the provider errors, but it still isn't able to communicate. It is important to note that this is a cleversafe s3 store and not aws.

milos-ljubinkovic · 2017-11-14T23:47:18Z

I think I never got the s3 urls to work in funnel when they contain access points so rabix just defaults to aws' default s3 host.

@buchanae Have you tested s3 support with other hosts and what should the urls look like? Might be I just used the wrong form.

kmhernan · 2017-11-15T00:14:39Z

@milos-ljubinkovic it seems like bunny is adding in a null to the uri for the s3

local                System error
attempt              0
error                Output upload not supported by storage: name:"run_get_uuid" url:"s3://null/bioinformatics_scratch/kmhernan/bunny/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" path:"/home/ubuntu/funnel-work-dir/b85ocllv48vjfgd4hsjg/home/ubuntu/kmhernan/custom_rabix/28fbf43e-6d2e-4da0-858f-44e5ec4f3888/root/run_load_contents/3/run_get_uuid" type:DIRECTORY
index                0
taskID               b85ocllv48vjfgd4hsjg
time                 2017-11-15T00:12:07Z
timestamp            2017-11-15T00:12:07.019349748Z

buchanae · 2017-11-15T02:05:46Z

Funnel doesn't support endpoints for S3 yet, and we're not sure Funnel's S3 client will work outside of AWS. ohsu-comp-bio/funnel#338

kmhernan · 2017-11-20T18:50:33Z

@buchanae do you know if the adding of the null to the URI is from the funnel side ? e.g., s3://null/

adamstruck · 2017-11-20T18:57:31Z

@kmhernan funnel does not modify the URIs that are given to it. That null must be coming from bunny.

milos-ljubinkovic · 2017-11-20T18:58:24Z

Bunny is adding that because I've hastily edited the s3 lib I use to generate urls without endpoints that funnel supports. When funnel gets support for endpoints I'll revert it back.

In the meantime I guess I could remove the null but funnel will still search for the file on the default amazon aws endpoint so it wouldn't matter in this case.

If you have files that are on amazon aws and declare them in task inputs without the endpoint it should be sent to funnel without null.

adamstruck · 2017-11-20T19:01:05Z

@milos-ljubinkovic @kmhernan I am working on extending our s3 support. I am hoping to get a PR up by tomorrow that you could test.

adamstruck · 2017-11-21T01:32:20Z

@kmhernan @milos-ljubinkovic -> ohsu-comp-bio/funnel#356

As I note in the PR there are still some outstanding issues I need to address. But this should unblock further testing / development.

kmhernan · 2017-11-21T01:36:55Z

great @adamstruck .. I just want to let you know @milos-ljubinkovic feel free to tag me in something to test on our ceph/cleversafe s3 stores when you get a chance

adamstruck · 2017-11-22T00:39:09Z

@milos-ljubinkovic funnel was updated to use the new TES v0.3 spec recently. To use these new features we will need to update the TES model in bunny. Let me know if you would like help with that.

kmhernan · 2017-11-22T13:09:42Z

@milos-ljubinkovic i made some changes locally to test the newest funnel release and attached the diffs here. Pretty basic, but no hardcore testing.
tes_update_diffs.txt

milos-ljubinkovic · 2017-11-23T11:44:31Z

I've updated the bunny TES models to v0.3 and I've been trying to test the generic s3 support from your pull request branch, but I can't get the urls with endpoints to work.

Is the config and file url supposed to look like this or do I have the wrong format:

Worker:
  Storage:
    S3:
      AWS:
        Endpoint: s3.amazonaws.com
        Key: key
        Secret: secret

"url": "s3://s3.amazonaws.com/testbunny/input.txt"

I can generate file urls without the endpoint and it works, but is this a valid solution?

adamstruck · 2017-11-23T16:04:12Z

The config isn't correct. Please look at the new docs:

https://github.com/adamstruck/funnel/blob/3370b1df2271886e56a9a883535aea196592f408/website/content/docs/storage/aws-s3.md

Sorry about the confusing config structure. It's something we are working on.

kmhernan · 2017-11-23T17:02:26Z

@adamstruck It seems like the S3 settings in the config file aren't propagated to the worker node...

This is the S3 section I see on the worker node which does not match my input config

    S3:
      FromEnv: false
      Key: ""
      Secret: ""

adamstruck · 2017-11-23T18:24:37Z

The config isn't propagated to the worker node unless there is a shared file system. You can put the config on the worker and modify the submit template to point to that file.

I'll see what I can do to make this more straight forward when I get back next week.

Also I'll need to investigate how your worker got that config structure since non of the storage backends in this branch use the 'fromEnv' keyword.

kmhernan · 2017-11-28T13:42:21Z

@adamstruck my test of funnel from your fork with my cleversafe endpoint was successful as long as I didn't have the http:// prefix to the endpoint. @milos-ljubinkovic for the url of the file I did not include the endpoint, e.g., my url was "s3://bioinformatics_scratch/kmhernan/meta.txt" I hope this helps

milos-ljubinkovic · 2018-01-10T17:40:26Z

Added support for s3 endpoints in the latest bunny release. The lib I used had some issues with google storage s3 due to differences in the request parameters, so not sure it will work with IBM's implementation.

adamstruck · 2018-02-20T22:26:16Z

Scatter seems to be working for me. Whether or not all tasks are started simultaneously is determined by the pool size which can now be configured (rabix.tes.task-thread-pool-size).

buchanae mentioned this issue Nov 28, 2017

Support for non-Amazon S3 services ohsu-comp-bio/funnel#356

Merged

johnbradley mentioned this issue Dec 6, 2017

Compare performance of CWL implementations Duke-GCB/lando#103

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bunny with TES backend does not handle scatter #382

bunny with TES backend does not handle scatter #382

kmhernan commented Nov 9, 2017

milos-ljubinkovic commented Nov 9, 2017

kmhernan commented Nov 10, 2017

milos-ljubinkovic commented Nov 10, 2017

kmhernan commented Nov 10, 2017

kmhernan commented Nov 10, 2017 •

edited

Loading

milos-ljubinkovic commented Nov 13, 2017

buchanae commented Nov 13, 2017

kmhernan commented Nov 14, 2017

milos-ljubinkovic commented Nov 14, 2017

buchanae commented Nov 14, 2017

kmhernan commented Nov 14, 2017

milos-ljubinkovic commented Nov 14, 2017

kmhernan commented Nov 15, 2017

buchanae commented Nov 15, 2017

kmhernan commented Nov 20, 2017

adamstruck commented Nov 20, 2017

milos-ljubinkovic commented Nov 20, 2017

adamstruck commented Nov 20, 2017

adamstruck commented Nov 21, 2017

kmhernan commented Nov 21, 2017

adamstruck commented Nov 22, 2017

kmhernan commented Nov 22, 2017

milos-ljubinkovic commented Nov 23, 2017 •

edited

Loading

adamstruck commented Nov 23, 2017

kmhernan commented Nov 23, 2017

adamstruck commented Nov 23, 2017

kmhernan commented Nov 28, 2017

milos-ljubinkovic commented Jan 10, 2018

adamstruck commented Feb 20, 2018

bunny with TES backend does not handle scatter #382

bunny with TES backend does not handle scatter #382

Comments

kmhernan commented Nov 9, 2017

milos-ljubinkovic commented Nov 9, 2017

kmhernan commented Nov 10, 2017

milos-ljubinkovic commented Nov 10, 2017

kmhernan commented Nov 10, 2017

kmhernan commented Nov 10, 2017 • edited Loading

milos-ljubinkovic commented Nov 13, 2017

buchanae commented Nov 13, 2017

kmhernan commented Nov 14, 2017

milos-ljubinkovic commented Nov 14, 2017

buchanae commented Nov 14, 2017

kmhernan commented Nov 14, 2017

milos-ljubinkovic commented Nov 14, 2017

kmhernan commented Nov 15, 2017

buchanae commented Nov 15, 2017

kmhernan commented Nov 20, 2017

adamstruck commented Nov 20, 2017

milos-ljubinkovic commented Nov 20, 2017

adamstruck commented Nov 20, 2017

adamstruck commented Nov 21, 2017

kmhernan commented Nov 21, 2017

adamstruck commented Nov 22, 2017

kmhernan commented Nov 22, 2017

milos-ljubinkovic commented Nov 23, 2017 • edited Loading

adamstruck commented Nov 23, 2017

kmhernan commented Nov 23, 2017

adamstruck commented Nov 23, 2017

kmhernan commented Nov 28, 2017

milos-ljubinkovic commented Jan 10, 2018

adamstruck commented Feb 20, 2018

kmhernan commented Nov 10, 2017 •

edited

Loading

milos-ljubinkovic commented Nov 23, 2017 •

edited

Loading