CheerioCrawler's gzip decompression fails on certain pages #266

jancurn · 2018-12-24T09:27:18Z

For example, when you try to crawl https://www.ebay.com/sch/sis.html?_nkw=Beechcraft it consistently fails with:

ERROR: BasicCrawler: handleRequestFunction failed, reclaiming failed request back to the list or queue {"url":"https://www.ebay.com/sch/i.html?_nkw=dodge","retryCount":1}
  Error: incorrect header check
    at Gunzip.zlibOnError (zlib.js:153:15)

When you try to open the same page in Firefox and set Accept-Encoding: gzip, the page loads fine.

The text was updated successfully, but these errors were encountered:

mnmkng · 2020-06-06T17:32:14Z

Still not fixed, but at least it led me to a bug in http-request. We were swallowing these errors and requests were timing out, instead of throwing.

The trouble is that the incorrect header check pops in the middle of the response body. I tried all zlib decompress functions and none worked.

It could be fixed by retrying the request with accept-encoding: identity, but we would lose the browser headers in requestAsBrowser.

mnmkng · 2020-06-06T19:10:52Z

nodejs/node#7502 Not sure if it's even possible to do it anymore. But let's see.

pocesar · 2020-06-06T20:00:17Z

you don't need to go far, it happens with Node and the log API endpoint, I was having this exact problem when providing a manually crafted Range header. it works in the browser, but fails on Node... tried a lot of request libraries as well. unless you forge a zlib header then try again, you still get corrupt data after getting past the internal incorrect header check

B4nan · 2021-09-10T14:37:47Z

cc @szmarczak, looks like its the same with got, maybe you have some ideas?

incorrect header check
RequestError: incorrect header check
    at PassThrough.<anonymous> (/Users/adamek/htdocs/apify/apify-js/node_modules/got-cjs/dist/source/core/index.js:614:31)
    at Object.onceWrapper (node:events:476:26)
    at PassThrough.emit (node:events:381:22)
    at emitErrorNT (node:internal/streams/destroy:188:8)
    at emitErrorCloseNT (node:internal/streams/destroy:153:3)
    at processTicksAndRejections (node:internal/process/task_queues:81:21)
    at Zlib.zlibOnError [as onerror] (node:zlib:190:17)

szmarczak · 2021-09-10T16:03:47Z

The website improperly compressed the data, however it's still possible to decode this. This is against the spec so won't be fixed in Got. The solution is to get the buffer / stream and decode it with DeflateRaw instead.

B4nan · 2022-07-15T11:19:40Z

Closing, either its somehow fixed in crawlee or one of the dependencies, or that URL no longer handles compression wrongly.

Works fine with cheerio (got-scraping) as well as playwright in crawlee.

szmarczak · 2022-07-15T23:20:14Z

Yes, it's been fixed: https://github.com/apify/got-scraping/blob/8d9d4f1e6f0144cfd7993c958fcc71e6c61c00ff/src/hooks/fix-decompress.ts#L49

jancurn added bug Something isn't working. high priority labels Dec 24, 2018

jancurn mentioned this issue Dec 24, 2018

Create requestLikeBrowser function #255

Closed

mnmkng removed the high priority label Aug 3, 2020

mnmkng mentioned this issue Jan 3, 2022

RequestError: unknown compression method #1248

Closed

B4nan closed this as completed Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CheerioCrawler's gzip decompression fails on certain pages #266

CheerioCrawler's gzip decompression fails on certain pages #266

jancurn commented Dec 24, 2018

mnmkng commented Jun 6, 2020

mnmkng commented Jun 6, 2020

pocesar commented Jun 6, 2020 •

edited

Loading

B4nan commented Sep 10, 2021

szmarczak commented Sep 10, 2021

B4nan commented Jul 15, 2022

szmarczak commented Jul 15, 2022

CheerioCrawler's gzip decompression fails on certain pages #266

CheerioCrawler's gzip decompression fails on certain pages #266

Comments

jancurn commented Dec 24, 2018

mnmkng commented Jun 6, 2020

mnmkng commented Jun 6, 2020

pocesar commented Jun 6, 2020 • edited Loading

B4nan commented Sep 10, 2021

szmarczak commented Sep 10, 2021

B4nan commented Jul 15, 2022

szmarczak commented Jul 15, 2022

pocesar commented Jun 6, 2020 •

edited

Loading