-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bacalhau run stuck on 'Finding node(s) for the job' #2033
Comments
there seem to be an issue with our latest cli where it defaults to our development endpoint instead of production. Setting |
To clarify, the job now is failing because it times-out event after increasing |
I followed @wesfloyd advice and added the file to web3.storage. I do notice it still says cc @TaylorOshan |
@jsolly checking back to see if Web3.storage is working alright for you now? We can help with the pinning effort or find another pinning service if you're having issues |
Web3.storage is still pinning the resource. I will give it until Monday and if it isn't pinned, I might need to find another pinning service. |
Hey folks - I'm from the web3.storage team. Re:
So that's not exactly true. As long as your upload completed, the content is available on IPFS. Have you tested reading the data off the network (either via an IPFS node or an HTTP gateway)? Sorry about the confusion - we only report a "pinning status" because that's what people are used to with Kubo nodes. We artificially mimic what a Kubo node does by traversing the entire graph of uploaded data to make sure it's a complete graph. It's a super inefficient process to do while the data itself is being uploaded, and adds a lot of latency, so we do it asynchronously. However, for large graphs of data, the job that traverses the graph can time out, and the dashboard reports this "pinning" state indefinitely. But if you're able to access your content via IPFS or HTTP gateways, your content was fully uploaded. |
FWIW - we're moving away from reporting this status at all once we move to our new upload API (currently in beta - you're welcome to try it out!) https://blog.web3.storage/posts/w3up-beta-launch |
Thanks for the message @dchoi27! I do think that's a pretty bad user experience, especially for people not from a Kubo background (me). You mentioned,
But I am still seeing this indefinite pinning status with files that are ~15MB, is that considered a 'large graph of data' ? I have files that are ~400Mb that made it to the 'complete' status, so perhaps something else is going on? I'll definitely check out this Upload API! Thanks for sharing. |
I think I need to find another pinning service. I am not able to download the file through the browser link: And when trying
I tried the migration and ran into a different error. |
The graph size is correlated with but not exactly the size of the data (you could have a lot of tiny blocks). In any case, this isn't an issue with the new API at all, and we no longer report "pinning status" in the new API (since it has minimal benefit to report as a hosted service provider, and is really unscalable to report). But I was mistaken that that was the issue - for the upload corresponding to For
I think this is a Kubo/Bacalhau thing - unfortunately I don't think we or any other IPFS provider / pinning service can help. (maybe @wesfloyd has an idea) In any case, if you do try out other hosted IPFS providers, would be curious about your experience there - we've designed web3.storage specifically to avoid the scale problems you see with other pinning providers (have a blog post on this here https://blog.web3.storage/posts/web3-storage-architecture). Of course, there still can be annoying UX kinks as we try to move the community forward to a more scalable/performant place (talk less about pinning, more about CAR files), but think we're on the right track. |
Fair enough @dchoi27! I am down to use the new uploader. In the meantime, I did re-upload the file to web3.storage and that CID now shows as 'complete' I am able to download the file at: My IPFS CLI is still broken, but I am pretty sure the file is on IPFS now since I can access it at the address above in a browser. I tried running the Bacalhau job again and it failed with a timeout error: @wesfloyd can you try I am suspecting there is still a download issue from IPFS even though it seems like the upload via web3.storage was successful. @dchoi27 I see there are a lot of different ways to use this new beta uploader. I would like to use a CLI. I think this is the one I should be using, https://github.com/web3-storage/w3cli. Is that right? cc @TaylorOshan |
Yes the CLI is the recommended one! Though keep your eyes peeled for a new version coming out either late this week or early next week - there will be some major UX upgrades (for instance, during the beta period new uploads won't show up in the web3.storage web app, but we'll have a beta web app specifically for w3up once the upgrade comes out) |
I can confirm the file can be downloaded via IPFS Testing the Bacalhau command now, it appears to continue running after 10 mins, which is surprising. And the job state sits in "Bid Accepted" @jsolly for our reference - do you have any guidance you could share on how long this job took to run on your local machine? I'll bring this up with our engineers now via the #bacalhau Slack channel |
After re-testing today I was able to get the job to run after 50s, the output displayed an error: So I updated the dockerfile to manually copy the segmentation script and sent a PR. The job is still running without ending, troubleshooting now. |
@wesfloyd it takes about 1 minute to run on my M1 Macbook. When running in docker, I did need to bump the ram allocated to the container to 12GB instead of 8GB else the job would fail. Since we are now past the 'Finding node(s) for job issue, I would like to close this one and possibly open a new one for this specific issue. cc @TaylorOshan |
@dchoi27 can you share some guidance on how we can best troubleshoot the IPFS file for this issue? I'm testing on as many different machines as possible (GCE, gitpod, local terminal) and the For CID: bafybeiblcnj6z4pkqmfxi7jxjvkaxue2kw5xxsfhdzwyjfe23vnhvukr7y (Here's an example where it is stalled at 50%Is there any way to determine whether the ipfs get inconsistency is due to a self-pinned instance, Web3.storage, some conflict between those two or otherwise? |
Hm - are you peering the IPFS node with the web3.storage infra? I was just able to fetch in about 1-2 min. In general, if you leave content discovery to the network, things can really slow down outside the control of the hosting node operator (since you're leaving things up to the mercy of the network). Peering is the easiest way to shortcut this. |
Context
When attempting to run a docker image on Bacalhau, it gets stuck on the 'Finding node(s) step.' Is this a temporary outage or is there a different issue?
Steps to Reproduce
It's been going for over 20 mins now. I will let it go overnight and see what happens 🤷♂️
Thoughts
I haven't tried other images.
I am on Arm64 architecture, but I built the image using:
docker buildx build --platform linux/amd64 -t segmentation_testbed .
and I confirmed the architecture in DockerHub.
Environment
Client Version: v0.3.22
Server Version: v0.3.22
Link to repository -> Segmentation Testbed
The text was updated successfully, but these errors were encountered: