-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CheerioCrawler's gzip decompression fails on certain pages #266
Comments
Still not fixed, but at least it led me to a bug in The trouble is that the It could be fixed by retrying the request with |
nodejs/node#7502 Not sure if it's even possible to do it anymore. But let's see. |
you don't need to go far, it happens with Node and the log API endpoint, I was having this exact problem when providing a manually crafted |
cc @szmarczak, looks like its the same with got, maybe you have some ideas?
|
The website improperly compressed the data, however it's still possible to decode this. This is against the spec so won't be fixed in Got. The solution is to get the buffer / stream and decode it with |
Closing, either its somehow fixed in crawlee or one of the dependencies, or that URL no longer handles compression wrongly. Works fine with cheerio (got-scraping) as well as playwright in crawlee. |
For example, when you try to crawl
https://www.ebay.com/sch/sis.html?_nkw=Beechcraft
it consistently fails with:When you try to open the same page in Firefox and set
Accept-Encoding: gzip
, the page loads fine.The text was updated successfully, but these errors were encountered: