Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using zip postprocessor creates a broken zip file if interrupted and resumed #355

Closed
valconius opened this issue Jul 26, 2019 · 1 comment

Comments

@valconius
Copy link

valconius commented Jul 26, 2019

When I'm making zip files using the postprocessor, upon resumption of some interrupted gallery-dl process, I can only view the files added after restarting gallery-dl in my zips with comic viewers because the filemodes/attributes for older files were not set before the interruption. See the end for what I presume is a fix to what is less a bug and more inconvenient and work-aroundable behavior.

I have the following postprocessor my config to create zip files and I am using the Windows binary release of gallery-dl.

"postprocessors":
[
	{
		"name": "zip",
		"compression": "store",
                "keep-files": false,
		"whitelist": ["exhentai"]
	}
]

I find a gallery and run the command, gallery-dl <gallery-url>, which produces <gallery>.zip. I allow the command to execute. Looking at the zip file's NameToInfo in a python REPL shows me everything is okay:

{
	'p0001.png': <ZipInfo filename='p0001.png' filemode='-rw-rw-rw-' file_size=1314184>, 
	'p0007.png': <ZipInfo filename='p0007.png' filemode='-rw-rw-rw-' file_size=1185571>, 
	'p0008.png': <ZipInfo filename='p0008.png' filemode='-rw-rw-rw-' file_size=1401698>, 
	'p0003.png': <ZipInfo filename='p0003.png' filemode='-rw-rw-rw-' file_size=1640022>, 
	'p0006.png': <ZipInfo filename='p0006.png' filemode='-rw-rw-rw-' file_size=1406659>, 
	'p0005.png': <ZipInfo filename='p0005.png' filemode='-rw-rw-rw-' file_size=871243>, 
	'p0004.png': <ZipInfo filename='p0004.png' filemode='-rw-rw-rw-' file_size=1261454>, 
	'p0002.png': <ZipInfo filename='p0002.png' filemode='-rw-rw-rw-' file_size=1016398>
}

Next, I delete the output from my previous run and repeat the command. However, I interrupt partway with a C-c keyboard interrupt (page 6 completes, but pages 7 and 8 are absent as expected). The zip file now has no filemode/permissions! Furthermore, the file's NameToInfo is empty at {}! A comic viewer can see everything just fine, but that is odd! I suppose something magical happens in your finalize() function in postprocessor/zip.py, but I don't know very much python so this is a mystery to me. All I know is when I do what you do in your run() function to append a file, everything works just fine; furthermore, if I make a zip file from scratch, then NameToInfo will show all its contents with appropriate filemode/attributes:

zfile = zipfile.ZipFile("comic.zip", "a", zipfile.ZIP_STORED, True)
zfile.write("p0007.png) # this will have rw filemode, unlike pages 1-6
zfile.NameToInfo        # Only contains { 'p0007.png': <...> } and nothing else
zfile.close()

So, to reproduce this:

  1. gallery-dl <gallery-url>
  2. C-c partway through
  3. gallery-dl <gallery-url>

The result is not the NameToInfo structure above, but only the subset containing the newer files for pages 7 and 8. Furthermore, comic viewers will only show/display the new files because their filemode is set but pages 1 through 6 have no filemode.

I've attached an image as a visual summary of what happened to me. In practice, substitute C-c for a loss of power, an application crash or something like that.

gallery-dl zip bug demo

I suppose in practice this would not normally be a real issue. If someone is scraping with gallery-dl then they would probably be interrupted via their config (in which case zfile.close() occurs and this issue does not apply to them), but if they did suffer some interruption otherwise, the fix is to simply manually fix the zip contents' permissions.

However, this manual intervention need could be precluded by opening and closing the zip file on every single instance that zfile.run() is called, which might seem bananas but considering the lion's share of each iteration is not spent calculating file opens and file closes, doing so would make everything much more idiot-proof and allow us to assume no manual intervention is ever required on interruptions. At least, that how things seem to me.

An example of a fix, I suppose, would be the following change in postprocessor/zip.py, or to do something like with ZipFile('spam.zip', 'a') as zfile: ... ![gallery-dl zip bug demo](https://user-images.githubusercontent.com/4463623/61985424-0b50c300-afbe-11e9-9593-0fb3656f8ae0.png) or whatever, python is really something I'm rather awful at... mode 'a' seems to work even if file doesn't exist, not that I know what I'm talking about):

    def __init__(self, pathfmt, options):
        PostProcessor.__init__(self)
        self.delete = not options.get("keep-files", False)
        self.ext = "." + options.get("extension", "zip")
        algorithm = options.get("compression", "store")
        if algorithm not in self.COMPRESSION_ALGORITHMS:
            self.log.warning(
                "unknown compression algorithm '%s'; falling back to 'store'",
                algorithm)
            algorithm = "store"

        self.path = pathfmt.realdirectory
        # DELETE THIS:
        self.zfile = zipfile.ZipFile(
            self.path + self.ext, "a",
            self.COMPRESSION_ALGORITHMS[algorithm], True)

    def run(self, pathfmt):
        # MOVE FROM ABOVE:
        #self.zfile = zipfile.ZipFile(
        #    self.path + self.ext, "a",
        #    self.COMPRESSION_ALGORITHMS[algorithm], True)

        # 'NameToInfo' is not officially documented, but it's available
        # for all supported Python versions and using it directly is a lot
        # better than calling getinfo()
        if pathfmt.filename not in self.zfile.NameToInfo:
            self.zfile.write(pathfmt.temppath, pathfmt.filename)
            pathfmt.delete = self.delete

        # MOVE FROM BELOW:
        self.zfile.close()

    def finalize(self):
        # DELETE THIS:
        #self.zfile.close()

        if self.delete:
            try:
                os.rmdir(self.path)
            except OSError:
                pass

            if not self.zfile.NameToInfo:
                try:
                    os.unlink(self.zfile.filename)
                except OSError:
                    pass
@mikf
Copy link
Owner

mikf commented Jul 27, 2019

First of all, thank you for all the work you've already put into this. Having detailed error descriptions is always nice.

The crux of the issue appears to be that the zip file index (?) only get completely written when ZipFile.close() is called and is left in an "intermediate" state between having written a file to it and actually closing it. 20f7b07 now makes sure that ZipFile.close() always get called during exceptions, KeyboardInterrupt included, but there is no way to have this happen when the Python interpreter itself is killed or a power outage happens.

Of course your with ZipFile('spam.zip', 'a') as zfile: ... suggestion would drastically reduce the chance of the program unexpectedly dying while the zip file is in an "intermediate" state, but I generally dislike opening and parsing the zip archive over and over again when doing this once is more than enough - except in this case.

There is also the possibility that the process is interrupted during a ZipFile.write() and a file is only partially written. Detecting and fixing such a case also needs to happen somehow ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants