Reading ZipFile not thread-safe #86535

Thomas · 2020-11-16T11:07:52Z

BPO	42369
Nosy	@ericvsmith, @serhiy-storchaka, @animalize, @TeamSpen210, @miss-islington, @cuibonobo, @mdavis-xyz, @kevinmehall
PRs	bpo-42369: Fix thread safety of zipfile._SharedFile.tell #26974 [3.9] bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974) #32008 [3.10] bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974) #32009
Files	test.py: Test case and traceback

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2022-03-20.16:00:32.275>
created_at = <Date 2020-11-16.11:07:51.727>
labels = ['type-bug', 'library', '3.9', '3.10', '3.11']
title = 'Reading ZipFile not thread-safe'
updated_at = <Date 2022-03-20.16:00:32.274>
user = 'https://bugs.python.org/Thomas'

bugs.python.org fields:

activity = <Date 2022-03-20.16:00:32.274>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2022-03-20.16:00:32.275>
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation = <Date 2020-11-16.11:07:51.727>
creator = 'Thomas'
dependencies = []
files = ['49601']
hgrepos = []
issue_num = 42369
keywords = ['patch']
message_count = 13.0
messages = ['381090', '381126', '381139', '381205', '381212', '396792', '396809', '398741', '409596', '412320', '415607', '415609', '415610']
nosy_count = 10.0
nosy_names = ['eric.smith', 'serhiy.storchaka', 'malin', 'Thomas', 'Spencer Brown', 'miss-islington', 'cuibonobo', 'mdavis-xyz', 'kevinmehall', 'khaledk']
pr_nums = ['26974', '32008', '32009']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue42369'
versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

Thomas · 2020-11-16T11:07:51Z

According to https://docs.python.org/3.5/whatsnew/changelog.html#id108 bpo-14099, reading multiple ZipExtFiles should be thread-safe, but it is not.

I created a small example where two threads try to read files from the same ZipFile simultaneously, which crashes with a Bad CRC-32 error. This is especially surprising since all files in the ZipFile only contain 0-bytes and have the same CRC.

My use case is a ZipFile with 82000 files. Creating multiple ZipFiles from the same "physical" zip file is not a satisfactory workaround because it takes several seconds each time. Instead, I open it only once and clone it for each thread:

with zipfile.ZipFile("/tmp/dummy.zip", "w") as dummy:
    pass

def clone_zipfile(z):
    z_cloned = zipfile.ZipFile("/tmp/dummy.zip")
    z_cloned.NameToInfo = z.NameToInfo
    z_cloned.fp = open(z.fp.name, "rb")
    return z_cloned

This is a much better solution for my use case than locking. I am using multiple threads because I want to finish my task faster, but locking defeats that purpose.

However, this cloning is somewhat of a dirty hack and will break when the file is not a real file but rather a file-like object.

Unfortunately, I do not have a solution for the general case.

ericvsmith · 2020-11-16T17:14:36Z

I'm changing from "crash" to "behavior". We use "crash" for a segfault or equivalent. I realize that most people are unlikely to know this, but we consider "crash" to be more alarming, so I want to make sure it's correct.

Also: when this happens, is it always for file 127, or does it change on each run?

Thomas · 2020-11-16T18:40:07Z

I have not observed any segfaults yet. Only zipfile.BadZipFile exceptions so far.

The exact file at which it crashes is fairly random. It even crashes if all threads try to read the same file multiple times.

I think the root cause of the problem is that the reads of zef_file in ZipFile.read are not locked properly.

cpython/Lib/zipfile.py

Line 1515 in c79667f

zef_file = _SharedFile(self.fp, zinfo.header_offset,

The underlying file object is shared between all ZipExtFiles. Every time a thread makes a call to ZipFile.read, a new lock is created in _SharedFile, but that lock only protects against multiple threads reading the same ZipExtFile. Multiple threads reading different ZipExtFiles with the same underlying file object will cause trouble. The locks do nothing in this scenario because they are individual to each thread and not shared.

Thomas · 2020-11-17T05:08:07Z

Scratch what I said in the previous message. I thought that the lock was created in _SharedFile and did not notice that it was passed as a parameter.

Thomas · 2020-11-17T07:17:54Z

I have simplified the test case a bit more:

import multiprocessing.pool, zipfile

# Create a ZipFile with two files and same content
with zipfile.ZipFile("test.zip", "w", zipfile.ZIP_STORED) as z:
    z.writestr("file1", b"0"*10000)
    z.writestr("file2", b"0"*10000)

# Read file1  with two threads at once
with zipfile.ZipFile("test.zip", "r") as z:
    pool = multiprocessing.pool.ThreadPool(2)
    while True:
        pool.map(z.read, ["file1", "file1"])

Two files are sufficient to cause the error. It does not matter which files are read or which content they have.

I also narrowed down the point of failure a bit. After

self._file.seek(self._pos)

in _SharedFile.read (

cpython/Lib/zipfile.py

Line 742 in c79667f

self._file.seek(self._pos)

), the following assertion should hold:

assert(self._file.tell() == self._pos)

The issue occurs when seeking to position 35 (size of header + length of name). Most of the time, self._file.tell() will then be 35 as expected, but sometimes it is 8227 instead, i.e. 35 + 8192.

I am not sure how this can happen since the file object should be locked.

kevinmehall · 2021-06-30T19:13:56Z

I think I found the root cause of this problem and proposed a fix in #26974

To monkey-patch this fix on existing versions of Python, I'm using:

class PatchedSharedFile(zipfile._SharedFile):
    def __init__(self, *args):
        super().__init__(*args)
        self.tell = lambda: self._pos
zipfile._SharedFile = PatchedSharedFile

Thomas · 2021-07-01T06:54:56Z

The monkey patch works for me! Thank you very much! (I have only tested reading, not writing).

However, the lock contention of Python's ZipFile is so bad that using multiple threads actually makes the code run _slower_ than single threaded code when reading a zip file with many small files. For this reason, I am not using ZipFile any longer. Instead, I have implemented a subset of the zip spec without locks, which gives me a speedup of over 2500 % for reading many small files compared to ZipFile.

I think that the architecture of ZipFile should be reconsidered, but this exceeds the scope of this issue.

khaledk · 2021-08-02T09:52:12Z

Hi Thomas, I'm facing the same issue. Would you care to opensource your implementation.

Thomas · 2022-01-03T17:15:11Z

@khaledk I finally got some time off, so here you go https://github.com/99991/ParallelZipFile

I can not offer any support for a more correct implementation of the zip specification due to time constraints, but maybe the code is useful for you anyway.

mdavis-xyz · 2022-02-01T23:57:33Z

In addition to fixing any unexpected behavior, can we update the documentation [1] to state what the expected behavior is in terms of thread safety?

[1] https://docs.python.org/3/library/zipfile.html

serhiy-storchaka · 2022-03-20T14:26:24Z

New changeset e730ae7 by Kevin Mehall in branch 'main':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
e730ae7

miss-islington · 2022-03-20T14:51:22Z

New changeset 4352ca2 by Miss Islington (bot) in branch '3.10':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
4352ca2

miss-islington · 2022-03-20T14:54:39Z

New changeset 4aa8b80 by Miss Islington (bot) in branch '3.9':
bpo-42369: Fix thread safety of zipfile._SharedFile.tell (GH-26974)
4aa8b80

The `_SharedFile` tracks its own virtual position into the file as `self._pos` and updates it after reading or seeking. `tell()` should return this position instead of calling into the underlying file object, since if multiple `_SharedFile` instances are being used concurrently on the same file, another one may have moved the real file position. Additionally, calling into the underlying `tell` may expose thread safety issues in the underlying file object because it was called without taking the lock. Prior to this fix, the test case in https://bugs.python.org/issue42369#msg381212 reliably caused a `zipfile.BadZipFile: Bad CRC-32 for file 'file1'` after a few dozen reads; with this fix I have not seen this error. From-PR: gh#python/cpython!26974 Fixes: gh#python#86535 Patch: bh42369-thread-safety-zipfile-SharedFile.patch

Thomas mannequin added 3.7 (EOL) end of life 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 16, 2020

ericvsmith added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 16, 2020

serhiy-storchaka added 3.9 only security fixes 3.10 only security fixes 3.11 only security fixes and removed 3.7 (EOL) end of life 3.8 (EOL) end of life labels Mar 20, 2022

serhiy-storchaka closed this as completed Mar 20, 2022

ezio-melotti transferred this issue from another repository Apr 10, 2022

croth1 mentioned this issue Oct 24, 2022

ZipStore: zipfile.BadZipFile with multiprocess pytorch dataloader zarr-developers/zarr-python#1218

Closed

piskvorky mentioned this issue Mar 5, 2023

tarfile not re-entrant for multi-threading #67837

Closed

superatomic mentioned this issue Mar 21, 2023

zlib.error: Error -3 while decompressing data: invalid stored block lengths superatomic/tldr-man#3

Closed

e-belfer mentioned this issue Jan 22, 2024

Nightly builds failure 2024-01-20 catalyst-cooperative/pudl#3271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading ZipFile not thread-safe #86535

Reading ZipFile not thread-safe #86535

Thomas mannequin commented Nov 16, 2020

Thomas mannequin commented Nov 16, 2020

ericvsmith commented Nov 16, 2020

Thomas mannequin commented Nov 16, 2020

Thomas mannequin commented Nov 17, 2020

Thomas mannequin commented Nov 17, 2020

kevinmehall mannequin commented Jun 30, 2021

Thomas mannequin commented Jul 1, 2021

khaledk mannequin commented Aug 2, 2021

Thomas mannequin commented Jan 3, 2022

mdavis-xyz mannequin commented Feb 1, 2022

serhiy-storchaka commented Mar 20, 2022

miss-islington commented Mar 20, 2022

miss-islington commented Mar 20, 2022

Reading ZipFile not thread-safe #86535

Reading ZipFile not thread-safe #86535

Comments

Thomas mannequin commented Nov 16, 2020

Thomas mannequin commented Nov 16, 2020

ericvsmith commented Nov 16, 2020

Thomas mannequin commented Nov 16, 2020

Thomas mannequin commented Nov 17, 2020

Thomas mannequin commented Nov 17, 2020

kevinmehall mannequin commented Jun 30, 2021

Thomas mannequin commented Jul 1, 2021

khaledk mannequin commented Aug 2, 2021

Thomas mannequin commented Jan 3, 2022

mdavis-xyz mannequin commented Feb 1, 2022

serhiy-storchaka commented Mar 20, 2022

miss-islington commented Mar 20, 2022

miss-islington commented Mar 20, 2022