Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help needed] file size mismatch #2938

Closed
lx30011 opened this issue Sep 19, 2022 · 5 comments
Closed

[Help needed] file size mismatch #2938

lx30011 opened this issue Sep 19, 2022 · 5 comments

Comments

@lx30011
Copy link
Contributor

lx30011 commented Sep 19, 2022

I'm running into this issue where gallery-dl occasionally chokes on partial content (code 206). It outputs "file size mismatch" after each of the five tries and then gives up. I read the HttpDownloader source but I'm not getting any wiser. Would appreciate your help with debugging.

[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 200 11982943
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (1825585 < 11982943) (1/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 10157358
[downloader.http][debug] Resuming download at byte 1825585
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (1966080 < 11982943) (2/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 10016863
[downloader.http][debug] Resuming download at byte 1966080
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (2097152 < 11982943) (3/5)
[urllib3.connectionpool][debug] Resetting dropped connection: 8chan.moe
[urllib3.connectionpool][debug] https://8chan.moe:443 "GET /.media/759a3119a0a06e717c704de7f9dcc8916dc3f8a8039531a399a5982e847e7770.png HTTP/1.1" 206 9885791
[downloader.http][debug] Resuming download at byte 2097152
./8chan/v/640607 Meta Thread/678236-2 fug.mp4_20220827_142327.941.png
[downloader.http][warning] file size mismatch (2228224 < 11982943) (4/5)
^C

The site is https://8chan.moe/
I was testing using this thread (mostly SFW): https://8chan.moe/v/res/640607.html
I think the site serves all somewhat large files using chunks, at least all the files it choked on were larger ones.
This thread has a lot of large files (mostly SFW): https://8chan.moe/v/res/673938.html
If you run the extractor I recommend setting sleep to 1 to make sure you don't get rate limited.
You can let the extractor run and it should choke on some file.
Here is my extractor:

# -*- coding: utf-8 -*-

# Copyright 2017-2021 Mike Fährmann
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License version 2 as
# published by the Free Software Foundation.

from .common import Extractor, Message
from .. import text
import itertools

class _8chanThreadExtractor(Extractor):
  """Extractor for 8chan threads"""
  category = "8chan"
  subcategory = "thread"
  directory_fmt = ("{category}", "{boardUri}",
                   "{threadId} {subject[:50]}")
  filename_fmt = "{postId}{num:?-//} {filename[:80]}.{extension}"
  archive_fmt = "{boardUri}_{postId}_{num}"
  pattern = r"(?:https?://)?8chan\.moe/([^/]+)/res/(\d+)"

  def __init__(self, match):
    Extractor.__init__(self, match)
    self.board, self.thread = match.groups()

  def items(self):
    self.request("https://8chan.moe/")
    dust = "https://8chan.moe/{}/".format(self.board)
    self.request(dust, headers={"Referer": "https://8chan.moe/"})
    url = "https://8chan.moe/{}/res/{}.json".format(self.board, self.thread)
    thread = self.request(url, headers={"Referer": dust}).json()
    thread["postId"] = thread["threadId"]
    posts = thread.pop("posts")

    yield Message.Directory, thread

    for post in itertools.chain((thread,), posts):
      files = post.pop("files", ())
      if files:
        thread.update(post)
        for num, file in enumerate(files):
          file.update(thread)
          file["num"] = num
          file["_http_headers"] = {
            "Referer": "https://8chan.moe/{}/res/{}.html".format(self.board, self.thread)
          }
          url = "https://8chan.moe" + file["path"]
          text.nameext_from_url(file["originalName"], file)
          yield Message.Url, url, file
@lx30011
Copy link
Contributor Author

lx30011 commented Sep 19, 2022

Upon further investigation this doesn't seem to be an issue with gallery-dl, but rather the site.
Using wget I'd get Read error at byte 13893632/22135554 (The TLS connection was non-properly terminated.). Retrying. with the same frequency as with gallery-dl.
The error message led me to this issue and 8chan uses varnish so it might be relevant. varnish/hitch#127

Once I let gallery-dl retry it eventually succeeds in downloading the files. They look fine, videos playback fine, I'm just wondering whether there might be some corruption. I can't tell.

@mikf
Copy link
Owner

mikf commented Sep 20, 2022

I tried several things - --sleep 5, -o browser=firefox, exactly matching browser headers except cookies - and non of them worked. Sooner or later it would always end up with a [downloader.http][warning] file size mismatch.

What does work is sending captchaid and captchaexpiration cookies exported from a browser.
captchaexpiration is easy enough to replicate, but it might be impossible to automatically generate a captchaid.

@lx30011
Copy link
Contributor Author

lx30011 commented Sep 20, 2022

Thank you very much for testing. If you look at my extractor I'm requesting pages in a similar manner as to how a user would (root of the site, then board, then thread) with the goal of having the cookies generate, I noticed gallery-dl uses requests.Session underneath which stores cookies obtained during a session. Though I presume it doesn't work going by your response.

Just to make sure I understand, do you think captchaid and captchaexpiration have anything to do with the file size mismatch error? If the cookies are set, there's no file size mismatch? The cookie has a short expiry date of three minutes so it would need enough runs to be sure that it eliminates the issue.

@mikf
Copy link
Owner

mikf commented Sep 20, 2022

I think I figured it out. We can get both cookies by sending a request to https://8chan.moe/captcha.js with Accept: image/avif,image/webp,*/* and all the other usual headers (use your browser dev tools). This redirects to https://8chan.moe/.global/captchas/123456789e8c6a652e68e225 and sets both captchaid and captchaexpiration.

Afterwards those cookies would either need to be refreshed every few minutes, or maybe it would also work to unset or extend the expires value of captchaid.

Just to make sure I understand, do you think captchaid and captchaexpiration have anything to do with the file size mismatch error?

Yeah, they do. Not sending them causes 8chan to periodically drop the connection during downloads and we get file size mismatch errors, which does not happen when they are included in a request. Just try it out.

I'm requesting pages in a similar manner as to how a user would (root of the site, then board, then thread) with the goal of having the cookies generate

I thinks that's unnecessary. Just remove those two requests.


edit: Using a several hours old captchaid cookie still works, so just removing the expires value from it is the best option, I think

edit2: For reference, I'm using these as _http_headers:

    headers = {
        "Accept": "video/webm,video/ogg,video/*;q=0.9,application/ogg;q=0.7,audio/*;q=0.6,*/*;q=0.5",
        "Accept-Language": "en-US,en;q=0.5",
        "Range": "bytes=0-",
        "DNT": "1",
        "Connection": "keep-alive",
        "Referer": "https://8chan.moe/v/res/673938.html",
        "Cookie": "captchaexpiration=Tue, 20 Sep 2022 16:29:07 GMT; captchaid=6329e99ff54ecb27cafcdbda3ZS8RplhsSLtWoxq8dLyrTvswKEgK2jE2w+u4LirJVr3qnpfsPVwpetZVMTkKHhR6BlaL/Ox9E3QB+voGG7T0A==",
        "Sec-Fetch-Dest": "video",
        "Sec-Fetch-Mode": "no-cors",
        "Sec-Fetch-Site": "same-origin",
        "TE": "trailers",
    }

@mikf
Copy link
Owner

mikf commented Oct 11, 2022

I took your code from #2938 (comment), modified it quite a bit, and eventually ended up with a working extractor that does not get interrupted while downloading: 1696f68.

@mikf mikf closed this as completed Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants