Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected error messages after upgrading from 0.17.0 to 0.21.0 #454

Open
bajtos opened this issue Oct 24, 2023 · 3 comments
Open

Unexpected error messages after upgrading from 0.17.0 to 0.21.0 #454

bajtos opened this issue Oct 24, 2023 · 3 comments
Assignees

Comments

@bajtos
Copy link
Contributor

bajtos commented Oct 24, 2023

While upgrading rusty-lassie from Lassie 0.17.0 to 0.21.0 (see filecoin-station/rusty-lassie#68), I noticed new error messages in the test output.

These messages seem to be printed only when the retrieval is aborted via "max blocks" or "global timeout" feature.

Is this a known issue? How can I get rid of these messages?


The messages are repeated many times in my test output; I am posting one example for each. I also changed the formatting to make things easier to read.

2023-10-24T11:04:22.574+0200	ERROR	bitswap
providerquerymanager/providerquerymanager.go:344	
Received provider (12D3KooWNy2SmgY384B9EHVHdbSkvXEADwxjzd4ejXQqicxWpxxT) for cid (Qme3KPUshfASLi2kn78dr69hsNsW6DnpnFHnbBHyECiTjp) not requested

2023-10-24T11:04:23.191+0200	ERROR	lassie/retriever	
retriever/parallelpeerretriever.go:277	
Error recording retrieval failure for protocol transport-ipfs-gateway-http: internal error, no such active retrieval 80e3c2b4-0797-4bfc-bc72-80ee3df5fb90

cc @hannahhoward @rvagg

@rvagg
Copy link
Member

rvagg commented Oct 25, 2023

Thanks for the log output @bajtos; this is likely related to some context handling done across all the parallel workloads where cancellation needs to interrupt it all and there's been a few changes in that version range that adjust the timing subtly such that these errors may pop up. I haven't seen these particular ones but I have been bothered by some odd "context cancelled" log messages that shouldn't be seen!

What I can say is that neither of these errors is fatal, they're more of the type of "I probably shouldn't get in this state". They're annoying and they point to something that should be addressed, but I'm pretty confident they're just related to the retrieval bail state - the timing of "shut it all down" not hitting all the right things in the right order, but that doesn't indicate clean-up problems.

The first error is bitswap - we got a peer from the indexer and fed it in to bitswap but at the same time we also told bitswap to stop bothering with this CID.

The second error is from the SP scoring mechanism—keeping track of how SP's perform so we can make better choices about prioritising. "you told me a candidate failed a retrieval, but you already told me that retrieval had ended".

I'll look in to it and see if I can replicate it these.

@rvagg rvagg self-assigned this Oct 25, 2023
@bajtos
Copy link
Contributor Author

bajtos commented Oct 25, 2023

Thank you, @rvagg, for the detailed explanation! ❤️

I also assumed these errors were not fatal, so I am not worried. I agree with you these logs are annoying; that's why I opened this issue.

I can reliably reproduce this issue on my machine when I run cargo test in filecoin-station/rusty-lassie@5b0f37c

❯ cargo version
cargo 1.72.0 (103a7ff2e 2023-08-15)
❯ rustc --version
rustc 1.72.0 (5680fa18f 2023-08-23)
❯ go version
go version go1.20.4 darwin/arm64

@rvagg
Copy link
Member

rvagg commented Nov 27, 2023

(a note for myself that I started tinkering with test cases to try and capture this in my local branch rvagg/max-block-limit-probs-tmp, didn't make enough progress to share anything yet though, just don't want to forget what that branch is for!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants