Using zip postprocessor creates a broken zip file if interrupted and resumed #355

valconius · 2019-07-26T22:57:32Z

When I'm making zip files using the postprocessor, upon resumption of some interrupted gallery-dl process, I can only view the files added after restarting gallery-dl in my zips with comic viewers because the filemodes/attributes for older files were not set before the interruption. See the end for what I presume is a fix to what is less a bug and more inconvenient and work-aroundable behavior.

I have the following postprocessor my config to create zip files and I am using the Windows binary release of gallery-dl.

"postprocessors":
[
	{
		"name": "zip",
		"compression": "store",
                "keep-files": false,
		"whitelist": ["exhentai"]
	}
]

I find a gallery and run the command, gallery-dl <gallery-url>, which produces <gallery>.zip. I allow the command to execute. Looking at the zip file's NameToInfo in a python REPL shows me everything is okay:

{
	'p0001.png': <ZipInfo filename='p0001.png' filemode='-rw-rw-rw-' file_size=1314184>, 
	'p0007.png': <ZipInfo filename='p0007.png' filemode='-rw-rw-rw-' file_size=1185571>, 
	'p0008.png': <ZipInfo filename='p0008.png' filemode='-rw-rw-rw-' file_size=1401698>, 
	'p0003.png': <ZipInfo filename='p0003.png' filemode='-rw-rw-rw-' file_size=1640022>, 
	'p0006.png': <ZipInfo filename='p0006.png' filemode='-rw-rw-rw-' file_size=1406659>, 
	'p0005.png': <ZipInfo filename='p0005.png' filemode='-rw-rw-rw-' file_size=871243>, 
	'p0004.png': <ZipInfo filename='p0004.png' filemode='-rw-rw-rw-' file_size=1261454>, 
	'p0002.png': <ZipInfo filename='p0002.png' filemode='-rw-rw-rw-' file_size=1016398>
}

Next, I delete the output from my previous run and repeat the command. However, I interrupt partway with a C-c keyboard interrupt (page 6 completes, but pages 7 and 8 are absent as expected). The zip file now has no filemode/permissions! Furthermore, the file's NameToInfo is empty at {}! A comic viewer can see everything just fine, but that is odd! I suppose something magical happens in your finalize() function in postprocessor/zip.py, but I don't know very much python so this is a mystery to me. All I know is when I do what you do in your run() function to append a file, everything works just fine; furthermore, if I make a zip file from scratch, then NameToInfo will show all its contents with appropriate filemode/attributes:

zfile = zipfile.ZipFile("comic.zip", "a", zipfile.ZIP_STORED, True)
zfile.write("p0007.png) # this will have rw filemode, unlike pages 1-6
zfile.NameToInfo        # Only contains { 'p0007.png': <...> } and nothing else
zfile.close()

So, to reproduce this:

gallery-dl <gallery-url>
C-c partway through
gallery-dl <gallery-url>

The result is not the NameToInfo structure above, but only the subset containing the newer files for pages 7 and 8. Furthermore, comic viewers will only show/display the new files because their filemode is set but pages 1 through 6 have no filemode.

I've attached an image as a visual summary of what happened to me. In practice, substitute C-c for a loss of power, an application crash or something like that.

I suppose in practice this would not normally be a real issue. If someone is scraping with gallery-dl then they would probably be interrupted via their config (in which case zfile.close() occurs and this issue does not apply to them), but if they did suffer some interruption otherwise, the fix is to simply manually fix the zip contents' permissions.

However, this manual intervention need could be precluded by opening and closing the zip file on every single instance that zfile.run() is called, which might seem bananas but considering the lion's share of each iteration is not spent calculating file opens and file closes, doing so would make everything much more idiot-proof and allow us to assume no manual intervention is ever required on interruptions. At least, that how things seem to me.

An example of a fix, I suppose, would be the following change in postprocessor/zip.py, or to do something like with ZipFile('spam.zip', 'a') as zfile: ... ![gallery-dl zip bug demo](https://user-images.githubusercontent.com/4463623/61985424-0b50c300-afbe-11e9-9593-0fb3656f8ae0.png) or whatever, python is really something I'm rather awful at... mode 'a' seems to work even if file doesn't exist, not that I know what I'm talking about):

    def __init__(self, pathfmt, options):
        PostProcessor.__init__(self)
        self.delete = not options.get("keep-files", False)
        self.ext = "." + options.get("extension", "zip")
        algorithm = options.get("compression", "store")
        if algorithm not in self.COMPRESSION_ALGORITHMS:
            self.log.warning(
                "unknown compression algorithm '%s'; falling back to 'store'",
                algorithm)
            algorithm = "store"

        self.path = pathfmt.realdirectory
        # DELETE THIS:
        self.zfile = zipfile.ZipFile(
            self.path + self.ext, "a",
            self.COMPRESSION_ALGORITHMS[algorithm], True)

    def run(self, pathfmt):
        # MOVE FROM ABOVE:
        #self.zfile = zipfile.ZipFile(
        #    self.path + self.ext, "a",
        #    self.COMPRESSION_ALGORITHMS[algorithm], True)

        # 'NameToInfo' is not officially documented, but it's available
        # for all supported Python versions and using it directly is a lot
        # better than calling getinfo()
        if pathfmt.filename not in self.zfile.NameToInfo:
            self.zfile.write(pathfmt.temppath, pathfmt.filename)
            pathfmt.delete = self.delete

        # MOVE FROM BELOW:
        self.zfile.close()

    def finalize(self):
        # DELETE THIS:
        #self.zfile.close()

        if self.delete:
            try:
                os.rmdir(self.path)
            except OSError:
                pass

            if not self.zfile.NameToInfo:
                try:
                    os.unlink(self.zfile.filename)
                except OSError:
                    pass

The text was updated successfully, but these errors were encountered:

mikf · 2019-07-27T10:01:53Z

First of all, thank you for all the work you've already put into this. Having detailed error descriptions is always nice.

The crux of the issue appears to be that the zip file index (?) only get completely written when ZipFile.close() is called and is left in an "intermediate" state between having written a file to it and actually closing it. 20f7b07 now makes sure that ZipFile.close() always get called during exceptions, KeyboardInterrupt included, but there is no way to have this happen when the Python interpreter itself is killed or a power outage happens.

Of course your with ZipFile('spam.zip', 'a') as zfile: ... suggestion would drastically reduce the chance of the program unexpectedly dying while the zip file is in an "intermediate" state, but I generally dislike opening and parsing the zip archive over and over again when doing this once is more than enough - except in this case.

There is also the possibility that the process is interrupted during a ZipFile.write() and a file is only partially written. Detecting and fixing such a case also needs to happen somehow ...

mikf added a commit that referenced this issue Jul 27, 2019

ensure postproc finalize() is called during C-c or crash (#355)

20f7b07

mikf added a commit that referenced this issue Jul 29, 2019

[postprocessor:zip] add 'mode' option (#355)

a90280f

mikf added the partially-done label Dec 20, 2019

mikf closed this as completed Dec 18, 2020

mikf added feature-request and removed partially-done labels Dec 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using zip postprocessor creates a broken zip file if interrupted and resumed #355

Using zip postprocessor creates a broken zip file if interrupted and resumed #355

valconius commented Jul 26, 2019 •

edited

Loading

mikf commented Jul 27, 2019

Using zip postprocessor creates a broken zip file if interrupted and resumed #355

Using zip postprocessor creates a broken zip file if interrupted and resumed #355

Comments

valconius commented Jul 26, 2019 • edited Loading

mikf commented Jul 27, 2019

valconius commented Jul 26, 2019 •

edited

Loading