Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CheerioCrawler's gzip decompression fails on certain pages #266

Closed
jancurn opened this issue Dec 24, 2018 · 7 comments
Closed

CheerioCrawler's gzip decompression fails on certain pages #266

jancurn opened this issue Dec 24, 2018 · 7 comments
Labels
bug Something isn't working.

Comments

@jancurn
Copy link
Member

jancurn commented Dec 24, 2018

For example, when you try to crawl https://www.ebay.com/sch/sis.html?_nkw=Beechcraft it consistently fails with:

ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.ebay.com/sch/i.html?_nkw=dodge","retryCount":1}
  Error: incorrect header check
    at Gunzip.zlibOnError (zlib.js:153:15)

When you try to open the same page in Firefox and set Accept-Encoding: gzip, the page loads fine.

@mnmkng
Copy link
Member

mnmkng commented Jun 6, 2020

Still not fixed, but at least it led me to a bug in http-request. We were swallowing these errors and requests were timing out, instead of throwing.

The trouble is that the incorrect header check pops in the middle of the response body. I tried all zlib decompress functions and none worked.

It could be fixed by retrying the request with accept-encoding: identity, but we would lose the browser headers in requestAsBrowser.

@mnmkng
Copy link
Member

mnmkng commented Jun 6, 2020

nodejs/node#7502 Not sure if it's even possible to do it anymore. But let's see.

@pocesar
Copy link
Contributor

pocesar commented Jun 6, 2020

you don't need to go far, it happens with Node and the log API endpoint, I was having this exact problem when providing a manually crafted Range header. it works in the browser, but fails on Node... tried a lot of request libraries as well. unless you forge a zlib header then try again, you still get corrupt data after getting past the internal incorrect header check

@B4nan
Copy link
Member

B4nan commented Sep 10, 2021

cc @szmarczak, looks like its the same with got, maybe you have some ideas?

incorrect header check
RequestError: incorrect header check
    at PassThrough.<anonymous> (/Users/adamek/htdocs/apify/apify-js/node_modules/got-cjs/dist/source/core/index.js:614:31)
    at Object.onceWrapper (node:events:476:26)
    at PassThrough.emit (node:events:381:22)
    at emitErrorNT (node:internal/streams/destroy:188:8)
    at emitErrorCloseNT (node:internal/streams/destroy:153:3)
    at processTicksAndRejections (node:internal/process/task_queues:81:21)
    at Zlib.zlibOnError [as onerror] (node:zlib:190:17)

@szmarczak
Copy link
Contributor

The website improperly compressed the data, however it's still possible to decode this. This is against the spec so won't be fixed in Got. The solution is to get the buffer / stream and decode it with DeflateRaw instead.

@B4nan
Copy link
Member

B4nan commented Jul 15, 2022

Closing, either its somehow fixed in crawlee or one of the dependencies, or that URL no longer handles compression wrongly.

Works fine with cheerio (got-scraping) as well as playwright in crawlee.

@B4nan B4nan closed this as completed Jul 15, 2022
@szmarczak
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

5 participants