Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bacalhau run stuck on 'Finding node(s) for the job' #2033

Closed
jsolly opened this issue Feb 23, 2023 · 17 comments
Closed

Bacalhau run stuck on 'Finding node(s) for the job' #2033

jsolly opened this issue Feb 23, 2023 · 17 comments

Comments

@jsolly
Copy link

jsolly commented Feb 23, 2023

Context

When attempting to run a docker image on Bacalhau, it gets stuck on the 'Finding node(s) step.' Is this a temporary outage or is there a different issue?

Steps to Reproduce

bacalhau docker run -v bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y:/project/inputs jsolly/segmentation_testbed

image
It's been going for over 20 mins now. I will let it go overnight and see what happens 🤷‍♂️

Thoughts

I haven't tried other images.
I am on Arm64 architecture, but I built the image using:

docker buildx build --platform linux/amd64 -t segmentation_testbed .

and I confirmed the architecture in DockerHub.

Environment

Client Version: v0.3.22
Server Version: v0.3.22
Link to repository -> Segmentation Testbed

@wdbaruni
Copy link
Member

there seem to be an issue with our latest cli where it defaults to our development endpoint instead of production. Setting export BACALHAU_ENVIRONMENT=production before running the job would solve the issue for now. I am working on a better fix.

@wdbaruni
Copy link
Member

To clarify, the job now is failing because it times-out event after increasing --timeout to 1h, which is the current maximum value allowed in the network. Trying ipfs get bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y on my machine seem to make slow progress and then go stuck at 87.68%. That could be the reason it is currently failing. Where is the CID hosted?

@jsolly
Copy link
Author

jsolly commented Feb 28, 2023

I followed @wesfloyd advice and added the file to web3.storage. I do notice it still says pinning so I think it's not fully on IPFS yet. Will re-try once the status moves to complete.
image

cc @TaylorOshan

@wesfloyd
Copy link
Contributor

wesfloyd commented Mar 1, 2023

@jsolly checking back to see if Web3.storage is working alright for you now? We can help with the pinning effort or find another pinning service if you're having issues

@jsolly
Copy link
Author

jsolly commented Mar 2, 2023

Web3.storage is still pinning the resource. I will give it until Monday and if it isn't pinned, I might need to find another pinning service.

@dchoi27
Copy link

dchoi27 commented Mar 7, 2023

Hey folks - I'm from the web3.storage team.

Re:

I do notice it still says pinning so I think it's not fully on IPFS yet. Will re-try once the status moves to complete.

So that's not exactly true. As long as your upload completed, the content is available on IPFS. Have you tested reading the data off the network (either via an IPFS node or an HTTP gateway)?

Sorry about the confusion - we only report a "pinning status" because that's what people are used to with Kubo nodes. We artificially mimic what a Kubo node does by traversing the entire graph of uploaded data to make sure it's a complete graph. It's a super inefficient process to do while the data itself is being uploaded, and adds a lot of latency, so we do it asynchronously. However, for large graphs of data, the job that traverses the graph can time out, and the dashboard reports this "pinning" state indefinitely. But if you're able to access your content via IPFS or HTTP gateways, your content was fully uploaded.

@dchoi27
Copy link

dchoi27 commented Mar 7, 2023

FWIW - we're moving away from reporting this status at all once we move to our new upload API (currently in beta - you're welcome to try it out!) https://blog.web3.storage/posts/w3up-beta-launch

@jsolly
Copy link
Author

jsolly commented Mar 10, 2023

Thanks for the message @dchoi27!

I do think that's a pretty bad user experience, especially for people not from a Kubo background (me). You mentioned,

However, for large graphs of data, the job that traverses the graph can time out, and the dashboard reports this "pinning" state indefinitely.

But I am still seeing this indefinite pinning status with files that are ~15MB, is that considered a 'large graph of data' ?
image

I have files that are ~400Mb that made it to the 'complete' status, so perhaps something else is going on?

I'll definitely check out this Upload API! Thanks for sharing.

@jsolly
Copy link
Author

jsolly commented Mar 10, 2023

@jsolly checking back to see if Web3.storage is working alright for you now? We can help with the pinning effort or find another pinning service if you're having issues

I think I need to find another pinning service. I am not able to download the file through the browser link:
https://bafybeihu2bvhibtb4nv6mhxdkeu5ubwu5wz74n5iyfzgao4gbrzkivxr7y.ipfs.w3s.link/

And when trying ipfs get bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y I ran into an error:

Error: Your programs version (12) is lower than your repos (13).
Please update ipfs to a version that supports the existing repo, or run
a migration in reverse.
See https://github.com/ipfs/fs-repo-migrations/blob/master/run.md for details.

I tried the migration and ran into a different error.

@dchoi27
Copy link

dchoi27 commented Mar 10, 2023

The graph size is correlated with but not exactly the size of the data (you could have a lot of tiny blocks). In any case, this isn't an issue with the new API at all, and we no longer report "pinning status" in the new API (since it has minimal benefit to report as a hosted service provider, and is really unscalable to report).

But I was mistaken that that was the issue - for the upload corresponding to bafybeihu2bvhibtb4nv6mhxdkeu5ubwu5wz74n5iyfzgao4gbrzkivxr7y, it looks like that just upload failed (we only have ~2MB associated with that CID). If you just reupload, it should work fine

For

And when trying ipfs get bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y I ran into an error

I think this is a Kubo/Bacalhau thing - unfortunately I don't think we or any other IPFS provider / pinning service can help. (maybe @wesfloyd has an idea)

In any case, if you do try out other hosted IPFS providers, would be curious about your experience there - we've designed web3.storage specifically to avoid the scale problems you see with other pinning providers (have a blog post on this here https://blog.web3.storage/posts/web3-storage-architecture). Of course, there still can be annoying UX kinks as we try to move the community forward to a more scalable/performant place (talk less about pinning, more about CAR files), but think we're on the right track.

@jsolly
Copy link
Author

jsolly commented Mar 11, 2023

Fair enough @dchoi27! I am down to use the new uploader.

In the meantime, I did re-upload the file to web3.storage and that CID now shows as 'complete'
image

I am able to download the file at:
https://bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y.ipfs.w3s.link/

My IPFS CLI is still broken, but I am pretty sure the file is on IPFS now since I can access it at the address above in a browser.

I tried running the Bacalhau job again and it failed with a timeout error:
bacalhau describe cb7f7f12-84d3-4a6f-98d0-5370ec8a30bc

@wesfloyd can you try bacalhau docker run -v bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y:/project/inputs jsolly/segmentation_testbed and let me know if you can see anything else going on?

I am suspecting there is still a download issue from IPFS even though it seems like the upload via web3.storage was successful.

@dchoi27 I see there are a lot of different ways to use this new beta uploader. I would like to use a CLI. I think this is the one I should be using, https://github.com/web3-storage/w3cli. Is that right?

cc @TaylorOshan

@dchoi27
Copy link

dchoi27 commented Mar 13, 2023

Yes the CLI is the recommended one! Though keep your eyes peeled for a new version coming out either late this week or early next week - there will be some major UX upgrades (for instance, during the beta period new uploads won't show up in the web3.storage web app, but we'll have a beta web app specifically for w3up once the upgrade comes out)

@wesfloyd
Copy link
Contributor

wesfloyd commented Mar 13, 2023

I can confirm the file can be downloaded via IPFS
image

Testing the Bacalhau command now, it appears to continue running after 10 mins, which is surprising.
image

And the job state sits in "Bid Accepted"
image

@jsolly for our reference - do you have any guidance you could share on how long this job took to run on your local machine?

I'll bring this up with our engineers now via the #bacalhau Slack channel

@wesfloyd
Copy link
Contributor

After re-testing today I was able to get the job to run after 50s, the output displayed an error:
python3: can't open file '/project/inputs/segmentation_testbed_2.py': [Errno 2] No such file or directory

So I updated the dockerfile to manually copy the segmentation script and sent a PR.

The job is still running without ending, troubleshooting now.

@jsolly
Copy link
Author

jsolly commented Mar 15, 2023

@wesfloyd it takes about 1 minute to run on my M1 Macbook. When running in docker, I did need to bump the ram allocated to the container to 12GB instead of 8GB else the job would fail.

Since we are now past the 'Finding node(s) for job issue, I would like to close this one and possibly open a new one for this specific issue.

cc @TaylorOshan

@jsolly jsolly closed this as completed Mar 15, 2023
@wesfloyd
Copy link
Contributor

wesfloyd commented Mar 20, 2023

@dchoi27 can you share some guidance on how we can best troubleshoot the IPFS file for this issue? I'm testing on as many different machines as possible (GCE, gitpod, local terminal) and the ipfs get command works about 50% of the time. Eg are there any performance test tools to attempt to ipfs get a given CID multiple times from multiple separate geographies?

For CID: bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y

image

(Here's an example where it is stalled at 50%

Is there any way to determine whether the ipfs get inconsistency is due to a self-pinned instance, Web3.storage, some conflict between those two or otherwise?

@wesfloyd wesfloyd reopened this Mar 20, 2023
@dchoi27
Copy link

dchoi27 commented Mar 20, 2023

Hm - are you peering the IPFS node with the web3.storage infra? I was just able to fetch in about 1-2 min.

In general, if you leave content discovery to the network, things can really slow down outside the control of the hosting node operator (since you're leaving things up to the mercy of the network). Peering is the easiest way to shortcut this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants