-
-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Crawler] Sometimes gets stuck after navigating to the page and before extracting metadata #93
Comments
Can you try updating your 'HOARDER_VERSION' to 'latest'? I've improved the error logging and we'll be able to know why exactly it's failing. Please update to latest and report back to me with the error message :) |
|
Thanks! This is the 2nd time I'm seeing this issue, so I'll assume it's not a one off and that it's a bug. I'll try to debug it to understand why it happens. Thanks for the report! |
I tried bookmarking a few different pages on github so I wonder if the useragent or something similar is being blocked. I've seen it a few times but inconsistently with youtube. I've mostly been trying this out by marking things I want to come back to later. It's been pretty random. I can do a bit more testing tho, to see if I can find a pattern. |
The thing is, in this instance and in the previous report, it's not reproducible for me. I'm yet to find a website that reproduces it on my server. However, with the new log lines, I know exactly where it happens. Either when fetching the page content or when closing the browser context. I'll need to dig deeper to understand why any of them can get stuck. |
Let me know how I can help. |
Thanks a lot. Worst case, I might add some more debugging lines and ask you to try to re-reproduce :) |
If it helps, docker version is 25.0.4 and the output of uname -a:
|
Maybe while we're at it, maybe also share the logs of the browser container? Maybe there's something interesting there? |
I'm assuming that's the chrome container. |
I just confirmed that the host hoarder is running on isn't being blocked by github. I was able to load the page via another tool I have running in docker on the same host called Ladder. |
ok maybe another idea. Can you try restarting the chrome container itself and see if it helps? |
Same.
|
I was having the same issue for reddit links yesterday (getting stuck before metadata extraction) but i pasted the same link today and it worked fine. |
It's a very weird problem and so far I couldn't reproduce it locally 😔 |
I just updated my hoarder again and refreshed the link. It loaded/populated this time. |
The problem seems transient and when it happens it gets stuck for a while and then gets resolved on its own. I'm still trying to figure out what is this "state" that we get stuck on for some websites. |
Guys, is there anyone still facing this issue? If yes, can you try adding |
I haven't encountered the problem since the last time I posted in this thread. |
I too was having issues with scrapping sites, predominantly reddit. Adding the suggested flag did allow the worker to successfully crawl all of the bookmarks that it was failing on previously. I will see as I add new reddit links if it continues to be successful. For reference, here is my compose for the chrome container:
|
Ok, I'll consider this completed for now and we can always reopen if someone faces the issue again. |
I'm seeing this error in my logs:
For the sites it reports this, doesn't update the bookmark (no thumbnail, title, tags, etc...). I've tried to refresh the item but it always reports the same error (nothing between the curly braces).
And example url would be https://github.com/filebrowser/filebrowser
Given the lack of actionable info in the log I'm not sure how to proceed in troubleshooting.
The text was updated successfully, but these errors were encountered: