Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this a threading issue? #56

Closed
aronchick opened this issue Mar 14, 2022 · 4 comments
Closed

Is this a threading issue? #56

aronchick opened this issue Mar 14, 2022 · 4 comments

Comments

@aronchick
Copy link
Collaborator

aronchick commented Mar 14, 2022

When I execute the following command:

LOG_LEVEL=debug go test -run ^TestCommands$ github.com/filecoin-project/bacalhau/cmd/bacalhau/

The first (and only the first) test passes - though non-deterministically - e.g. could be any of the three randomly chosen.

Here's the log output:

[2207] DBG internal/dev_stack.go:116 > ==== Complete
[2207] DBG internal/dev_stack.go:123 > Finished provisioning nodes.
[2207] WRN bacalhau/devstack_test.go:213 >
========================================
Starting new job:
	name: awk
     cmd: awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/%s
    file: ../../testdata/awk_file.txt
========================================

[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [add -Q ../../testdata/awk_file.txt]
[2207] DBG bacalhau/devstack_test.go:275 > About to submit job:
cmd: awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG
[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"5b5a108a-3c17-4cef-a6be-277c42761788","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"job_created","JobSpec":{"Engine":"","Commands":["awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2\u003c0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG"],"Image":"","Cpu":0,"Memory":0,"Disk":0,"Inputs":[{"Engine":"ipfs","Cid":"QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG"}]},"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":null},"JobState":null}

[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs84900375}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs4201827199}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] INF bacalhau/submit.go:130 > Submitted Job Id: 5b5a108a-3c17-4cef-a6be-277c42761788

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs975119736}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [], pausing and trying again...

[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [], pausing and trying again...

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 25
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: true
[2207] INF internal/compute_node.go:132 > CID ([{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]) found on this server. Accepting job.
[2207] DBG internal/compute_node.go:65 > We are bidding on a job because the data is local!
&{Engine: Commands:[awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"5b5a108a-3c17-4cef-a6be-277c42761788","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid","JobSpec":null,"JobDeal":null,"JobState":{"State":"bidding","Status":"","Outputs":null}}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"5b5a108a-3c17-4cef-a6be-277c42761788","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid_accepted","JobSpec":null,"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":["QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua"]},"JobState":{"State":"running","Status":"","Outputs":null}}

[2207] DBG internal/compute_node.go:85 > BID ACCEPTED. Server (id: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua) - Job (id: 5b5a108a-3c17-4cef-a6be-277c42761788)
[2207] DBG internal/compute_node.go:145 > Running job on node: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:83 > Enforcing creation of results dir: /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG internal/compute_node.go:174 > Ensured results directory created: /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] INF internal/compute_node.go:180 > Results directory for job id (5b5a108a-3c17-4cef-a6be-277c42761788) exists: /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:54 > Running system command: sudo [docker run --privileged -d --rm --name bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 --entrypoint bash binocarlos/bacalhau-ignite-image:v1 -c tail -f /dev/null]
[2207] DBG runtime/runtime.go:146 > Script to run for job: sleep 2
awk -F',' '{x=38.7077507-$3; y=-9.1365919-$4; if(x^2+y^2<0.3^2) print}' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG
sleep 2
 R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG system/utils.go:54 > Running system command: sudo [docker cp /tmp/bacalhau-ignite-job.1711282195.sh bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562:/job.sh]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 ipfs init]
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 ipfs bootstrap rm --all]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 ipfs bootstrap add /ip4/127.0.0.1/tcp/34147/p2p/12D3KooWAc628yLorNsZC2XRV4zy5z7VGRNbE3nxo6u65F6VdKJd]
[2207] DBG runtime/runtime.go:208 > Running system ipfs daemon mount: /bin/sudo docker exec bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 ipfs daemon --mount R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG system/utils.go:28 > Running system command: sudo [docker exec bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562 psrecord bash /job.sh --log /tmp/metrics.log --plot /tmp/metrics.png --include-children]
2021-01-01 00:00:00,LISBON,38.5940,-8.9303,5.9366,21.9316
2021-01-01 16:39:00,LISBON,38.6990,-8.9610,10.9692,15.2670
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

/usr/local/lib/python3.8/dist-packages/psrecord/main.py:219: UserWarning: Attempting to set identical bottom == top == 0.0 results in singular transformations; automatically expanding.
  ax.set_ylim(0., max(log['cpu']) * 1.2)
Starting up command 'bash /job.sh' and attaching to process
Process finished (4.02 seconds)
[2207] DBG runtime/runtime.go:256 > Writing stdout to /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/stdout.log R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG runtime/runtime.go:260 > Directory found: /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG runtime/runtime.go:266 > Expected folder /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua exists?: true R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG runtime/runtime.go:273 > Writing stderr to /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/stderr.log
 R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] INF runtime/runtime.go:279 > Finished writing results of job (Id: 265f0663-6204-4da1-bc29-722bdb83c562) to results folder (/home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua). R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG runtime/runtime.go:288 > Copying files - Writing metrics.log to /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/metrics.log
 R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG runtime/runtime.go:288 > Copying files - Writing metrics.png to /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/metrics.png
 R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [add -rq /home/daaronch/.bacalhau/results/5b5a108a/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker rm -f bacalhau5b5a108a-3c17-4cef-a6be-277c42761788265f0663-6204-4da1-bc29-722bdb83c562]
[2207] DBG runtime/runtime.go:220 > Failed to run : exit status 137 R:5B5A108A-3C17-4CEF-A6BE-277C42761788
[2207] INF internal/compute_node.go:95 > Completed the job - results cid: QmReYEN87RiLa3DqFWLm5pvcW57zhxpjXh2P2zzpnqUfuF
&{Id:5b5a108a-3c17-4cef-a6be-277c42761788 Owner:QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua Spec:0xc002e60460 Deal:0xc0001f1e00 State:map[QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua:0xc0001f2980]}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"5b5a108a-3c17-4cef-a6be-277c42761788","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"results","JobSpec":null,"JobDeal":null,"JobState":{"State":"complete","Status":"Got job results cid: QmReYEN87RiLa3DqFWLm5pvcW57zhxpjXh2P2zzpnqUfuF","Outputs":[{"Engine":"ipfs","Cid":"QmReYEN87RiLa3DqFWLm5pvcW57zhxpjXh2P2zzpnqUfuF"}]}}

[2207] WRN bacalhau/devstack_test.go:213 >
========================================
Starting new job:
	name: grep
     cmd: grep kiwi /ipfs/%s
    file: ../../testdata/grep_file.txt
========================================

[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [add -Q ../../testdata/grep_file.txt]
[2207] DBG bacalhau/devstack_test.go:275 > About to submit job:
cmd: grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX
[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"d61f7f7f-4ab8-4488-ab3d-f25ffe43a159","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"job_created","JobSpec":{"Engine":"","Commands":["grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX"],"Image":"","Cpu":0,"Memory":0,"Disk":0,"Inputs":[{"Engine":"ipfs","Cid":"QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX"}]},"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":null},"JobState":null}

[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs975119736}
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs84900375}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs4201827199}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] INF bacalhau/submit.go:130 > Submitted Job Id: d61f7f7f-4ab8-4488-ab3d-f25ffe43a159

[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [], pausing and trying again...

[2207] WRN bacalhau/devstack_test.go:213 >
========================================
Starting new job:
	name: sed
     cmd: sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/%s
    file: ../../testdata/sed_file.txt
========================================

[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [add -Q ../../testdata/sed_file.txt]
[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 34
[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:214 > CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) in local refs: true
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) collecting to all refs in repo.
[2207] INF internal/compute_node.go:132 > CID ([{ipfs QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]) found on this server. Accepting job.
[2207] DBG internal/compute_node.go:65 > We are bidding on a job because the data is local!
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"d61f7f7f-4ab8-4488-ab3d-f25ffe43a159","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid","JobSpec":null,"JobDeal":null,"JobState":{"State":"bidding","Status":"","Outputs":null}}

[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"d61f7f7f-4ab8-4488-ab3d-f25ffe43a159","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid_accepted","JobSpec":null,"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":["QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua"]},"JobState":{"State":"running","Status":"","Outputs":null}}

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX}]}

[2207] DBG internal/compute_node.go:85 > BID ACCEPTED. Server (id: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua) - Job (id: d61f7f7f-4ab8-4488-ab3d-f25ffe43a159)
[2207] DBG internal/compute_node.go:145 > Running job on node: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:83 > Enforcing creation of results dir: /home/daaronch/.bacalhau/results/d61f7f7f/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG internal/compute_node.go:174 > Ensured results directory created: /home/daaronch/.bacalhau/results/d61f7f7f/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] INF internal/compute_node.go:180 > Results directory for job id (d61f7f7f-4ab8-4488-ab3d-f25ffe43a159) exists: /home/daaronch/.bacalhau/results/d61f7f7f/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:54 > Running system command: sudo [docker run --privileged -d --rm --name bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 --entrypoint bash binocarlos/bacalhau-ignite-image:v1 -c tail -f /dev/null]
[2207] DBG bacalhau/devstack_test.go:275 > About to submit job:
cmd: sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG
[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"73e5c514-62f3-42c1-8257-52039dcbe73d","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"job_created","JobSpec":{"Engine":"","Commands":["sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG"],"Image":"","Cpu":0,"Memory":0,"Disk":0,"Inputs":[{"Engine":"ipfs","Cid":"QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG"}]},"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":null},"JobState":null}

[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs84900375}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs4201827199}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG internal/compute_node.go:57 > Found new job to schedule:
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] INF bacalhau/submit.go:130 > Submitted Job Id: 73e5c514-62f3-42c1-8257-52039dcbe73d

[2207] DBG internal/compute_node.go:121 > Selecting for job with matching CID(s): [{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]

[2207] DBG ipfs/utils.go:195 > Beginning to collect all refs in IPFS Repo.
[2207] DBG ipfs/utils.go:196 > RepoPath {/tmp/bacalhau-ipfs975119736}
[2207] DBG system/utils.go:61 > Running system command: /usr/local/bin/ipfs [refs local]
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 34
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: true
[2207] INF internal/compute_node.go:132 > CID ([{ipfs QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]) found on this server. Accepting job.
[2207] DBG internal/compute_node.go:65 > We are bidding on a job because the data is local!
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"73e5c514-62f3-42c1-8257-52039dcbe73d","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid","JobSpec":null,"JobDeal":null,"JobState":{"State":"bidding","Status":"","Outputs":null}}

[2207] DBG libp2p/libp2p.go:328 > Sending event: {"JobId":"73e5c514-62f3-42c1-8257-52039dcbe73d","NodeId":"QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua","EventName":"bid_accepted","JobSpec":null,"JobDeal":{"Concurrency":1,"Confidence":1,"Tolerance":0.1,"AssignedNodes":["QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua"]},"JobState":{"State":"running","Status":"","Outputs":null}}

[2207] DBG ipfs/utils.go:202 > Finished collecting refs in IPFS Repo.
[2207] DBG ipfs/utils.go:206 > Comparing CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) collecting to all refs in repo.
[2207] DBG ipfs/utils.go:208 > Total number of local refs: 23
[2207] DBG ipfs/utils.go:214 > CID (QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG) in local refs: false
[2207] INF internal/compute_node.go:137 > No matching CIDs found on this server. Passing on job
[2207] DBG internal/compute_node.go:74 > We ignored a job because we didn't have the data:
&{Engine: Commands:[sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG] Image: Cpu:0 Memory:0 Disk:0 Inputs:[{Engine:ipfs Cid:QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG}]}

[2207] DBG internal/compute_node.go:85 > BID ACCEPTED. Server (id: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua) - Job (id: 73e5c514-62f3-42c1-8257-52039dcbe73d)
[2207] DBG internal/compute_node.go:145 > Running job on node: QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:83 > Enforcing creation of results dir: /home/daaronch/.bacalhau/results/73e5c514/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG internal/compute_node.go:174 > Ensured results directory created: /home/daaronch/.bacalhau/results/73e5c514/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] INF internal/compute_node.go:180 > Results directory for job id (73e5c514-62f3-42c1-8257-52039dcbe73d) exists: /home/daaronch/.bacalhau/results/73e5c514/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua
[2207] DBG system/utils.go:54 > Running system command: sudo [docker run --privileged -d --rm --name bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c --entrypoint bash binocarlos/bacalhau-ignite-image:v1 -c tail -f /dev/null]
[2207] DBG runtime/runtime.go:146 > Script to run for job: sleep 2
grep kiwi /ipfs/QmRyDNzrxwcL4ENNGyKLr9bf1FAs2unLKyngCb5tqaUpsX
sleep 2
 R:D61F7F7F-4AB8-4488-AB3D-F25FFE43A159
[2207] DBG system/utils.go:54 > Running system command: sudo [docker cp /tmp/bacalhau-ignite-job.2044548828.sh bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3:/job.sh]
[2207] DBG runtime/runtime.go:146 > Script to run for job: sleep 2
sed -n '/38.7[2-4]..,-9.1[3-7]../p' /ipfs/QmfNxptvVcr9R5X3KqxENWieFzRCKvncezkhf2WddMqhiG
sleep 2
 R:73E5C514-62F3-42C1-8257-52039DCBE73D
[2207] DBG system/utils.go:54 > Running system command: sudo [docker cp /tmp/bacalhau-ignite-job.578337918.sh bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c:/job.sh]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 ipfs init]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c ipfs init]
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 ipfs bootstrap rm --all]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c ipfs bootstrap rm --all]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 ipfs bootstrap add /ip4/127.0.0.1/tcp/34147/p2p/12D3KooWAc628yLorNsZC2XRV4zy5z7VGRNbE3nxo6u65F6VdKJd]
[2207] DBG system/utils.go:54 > Running system command: sudo [docker exec bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c ipfs bootstrap add /ip4/127.0.0.1/tcp/34147/p2p/12D3KooWAc628yLorNsZC2XRV4zy5z7VGRNbE3nxo6u65F6VdKJd]
[2207] DBG runtime/runtime.go:208 > Running system ipfs daemon mount: /bin/sudo docker exec bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 ipfs daemon --mount R:D61F7F7F-4AB8-4488-AB3D-F25FFE43A159
[2207] DBG runtime/runtime.go:208 > Running system ipfs daemon mount: /bin/sudo docker exec bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c ipfs daemon --mount R:73E5C514-62F3-42C1-8257-52039DCBE73D
[2207] DBG system/utils.go:43 > Error wait for results to be: Expected job to be complete, got [running], pausing and trying again...

[2207] DBG ipfs/utils.go:102 > got to after closing ipfs daemon

[2207] DBG libp2p/libp2p.go:109 > Closing bacalhau libp2p daemon

[2207] DBG libp2p/libp2p.go:109 > Closing bacalhau libp2p daemon

[2207] DBG libp2p/libp2p.go:109 > Closing bacalhau libp2p daemon

[2207] DBG ipfs/utils.go:102 > got to after closing ipfs daemon

[2207] DBG ipfs/utils.go:102 > got to after closing ipfs daemon

[2207] DBG libp2p/libp2p.go:111 > Closed bacalhau libp2p daemon

[2207] DBG libp2p/libp2p.go:111 > Closed bacalhau libp2p daemon

[2207] DBG libp2p/libp2p.go:111 > Closed bacalhau libp2p daemon

[2207] DBG system/utils.go:28 > Running system command: sudo [docker exec bacalhaud61f7f7f-4ab8-4488-ab3d-f25ffe43a159459fab34-0a26-4857-9707-0ccf27e226b3 psrecord bash /job.sh --log /tmp/metrics.log --plot /tmp/metrics.png --include-children]
[2207] DBG system/utils.go:28 > Running system command: sudo [docker exec bacalhau73e5c514-62f3-42c1-8257-52039dcbe73d22d93b06-fe9b-4293-9841-b14d0311e53c psrecord bash /job.sh --log /tmp/metrics.log --plot /tmp/metrics.png --include-children]
--- FAIL: TestCommands (30.88s)
    --- FAIL: TestCommands/grep (0.11s)
        devstack_test.go:235:
            	Error Trace:	devstack_test.go:235
            	Error:      	Received unexpected error:
            	            	open /home/daaronch/.bacalhau/results/d61f7f7f/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/stdout.log: no such file or directory
            	Test:       	TestCommands/grep
        devstack_test.go:237:
            	Error Trace:	devstack_test.go:237
            	Error:      	Should be true
            	Test:       	TestCommands/grep
        devstack_test.go:239:
            	Error Trace:	devstack_test.go:239
            	Error:      	Not equal:
            	            	expected: 1
            	            	actual  : 4
            	Test:       	TestCommands/grep
            	Messages:   	Count mismatch:
            	            	Expected: 4
            	            	Actual: 1
    --- FAIL: TestCommands/sed (6.15s)
        devstack_test.go:235:
            	Error Trace:	devstack_test.go:235
            	Error:      	Received unexpected error:
            	            	open /home/daaronch/.bacalhau/results/73e5c514/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua/stdout.log: no such file or directory
            	Test:       	TestCommands/sed
        devstack_test.go:237:
            	Error Trace:	devstack_test.go:237
            	Error:      	Should be true
            	Test:       	TestCommands/sed
        devstack_test.go:239:
            	Error Trace:	devstack_test.go:239
            	Error:      	Not equal:
            	            	expected: 1
            	            	actual  : 7
            	Test:       	TestCommands/sed
            	Messages:   	Count mismatch:
            	            	Expected: 7
            	            	Actual: 1

I SUSPECT this is something about after the job finishing, the system closes down the nodes.

For example, midway through the log, you can see:

[2207] INF internal/compute_node.go:180 > Results directory for job id (73e5c514-62f3-42c1-8257-52039dcbe73d) exists: /home/daaronch/.bacalhau/results/73e5c514/QmVn72EZAJu5xtzXeCbGTA4sTCLVcw57pHMa5p9FDBADua

Which is confirmed here - https://github.com/filecoin-project/bacalhau/blob/reformatting_logs_for_multi_threading/internal/compute_node.go#L176

However, by the time it gets to here: https://github.com/filecoin-project/bacalhau/blob/d0aef41ca20deb14741a614eced28d92e3ed0f78/internal/runtime/runtime.go#L245

The directory is gone. A bit stumped - my suspicion:

  1. The directory is gone because ipfs daemon shut down somewhere (here? https://github.com/filecoin-project/bacalhau/blob/9b2ea198cdb570804a1884c404ac9d15393928f1/internal/ipfs/utils.go#L100 or here? https://github.com/filecoin-project/bacalhau/blob/9b2ea198cdb570804a1884c404ac9d15393928f1/internal/jsonrpc_server.go#L75 or here? https://github.com/filecoin-project/bacalhau/blob/9b2ea198cdb570804a1884c404ac9d15393928f1/internal/scheduler/libp2p/libp2p.go#L108)

(I think the last one is the most likely, since it's the newest)

  1. OR the "job preparation" is not being done on the same node as the runtime, but this feels wrong. The same vm is being run.
@binocarlos
Copy link
Contributor

Ahhh this old chestnut :-)

So - me and Luke lost a day to this also, it turns out that attempting to write directly to os.Std{out,err} from a go-routine does very strange things with locking the process - we came across this when trying to close devstack after a test was running

Fixed this issue by removing TeeCommand and just printing the std{out,err} buffers - the command is now called RunCommandWithBuffer

It would be nice to know exactly why directly writing to os.Std{err,out} causes these deadlocks but let's bump that to a "deep dive to understand go concurrent stream writing" issue

@binocarlos
Copy link
Contributor

So - it turns out that there is a more fundamental problem here: golang/go#23019

In summary:

  • exec a command
  • it spawns sub-commands which spawns sub-commands (bash job.sh -> grand-children)
  • stdout / stderr pipes from grand-children are linked to main go process
  • kill the top level child process

This will cause the deadlock because the grand-child is still thinking it's connected to the stdout/stderr pipes

The solution it seems is to always write to temporary files instead of os.Std{out,err} - this does not trigger the blocking code path....

So - let's update utils.go to handle running commands in this way so we don't get bitten by this again

@lukemarsden
Copy link
Contributor

This is how the Android team worked around the problem: https://go-review.googlesource.com/c/go/+/42271/3/misc/android/go_android_exec.go

We are doing something similar.

@aronchick
Copy link
Collaborator Author

Fixed with this - a60f080

simonwo pushed a commit that referenced this issue Dec 6, 2023
* update the docs home page

* add images
simonwo pushed a commit that referenced this issue Dec 6, 2023
* typo fixes

* minor fixes

* update examples via bacalhau-project/examples@ed2a0e2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants