Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tarfile not re-entrant for multi-threading #67837

Closed
sgnn7 mannequin opened this issue Mar 12, 2015 · 13 comments
Closed

tarfile not re-entrant for multi-threading #67837

sgnn7 mannequin opened this issue Mar 12, 2015 · 13 comments
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@sgnn7
Copy link
Mannequin

sgnn7 mannequin commented Mar 12, 2015

BPO 23649
Nosy @gustaebel, @bitdancer, @xdegaye, @websurfer5
Files
  • mutithreading_tarfile.patch: makdirs_patch
  • extract_from_packages.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-03-12.16:40:01.811>
    labels = ['type-bug']
    title = 'tarfile not re-entrant for multi-threading'
    updated_at = <Date 2019-06-10.14:55:43.587>
    user = 'https://bugs.python.org/sgnn7'

    bugs.python.org fields:

    activity = <Date 2019-06-10.14:55:43.587>
    actor = 'vstinner'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = []
    creation = <Date 2015-03-12.16:40:01.811>
    creator = 'sgnn7'
    dependencies = []
    files = ['38462', '47523']
    hgrepos = []
    issue_num = 23649
    keywords = ['patch']
    message_count = 11.0
    messages = ['237960', '237961', '237964', '237968', '237970', '237986', '237989', '237990', '238181', '238183', '315067']
    nosy_count = 5.0
    nosy_names = ['lars.gustaebel', 'r.david.murray', 'xdegaye', 'sgnn7', 'Jeffrey.Kintscher']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue23649'
    versions = ['Python 3.4']

    Linked PRs

    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    When running tarfile.extract through multiple threads, the archive reading pointer is not protected from simultaneous seeks and causes various convoluted bugs:

    <some code>
    self.archive_object.extract(member, extraction_path)
    File "/usr/lib/python3.4/tarfile.py", line 2019, in extract
    set_attrs=set_attrs)
    File "/usr/lib/python3.4/tarfile.py", line 2088, in _extract_member
    self.makefile(tarinfo, targetpath)
    File "/usr/lib/python3.4/tarfile.py", line 2127, in makefile
    source.seek(tarinfo.offset_data)
    File "/usr/lib/python3.4/gzip.py", line 573, in seek
    self.read(1024)
    File "/usr/lib/python3.4/gzip.py", line 365, in read
    if not self._read(readsize):
    File "/usr/lib/python3.4/gzip.py", line 449, in _read
    self._read_eof()
    File "/usr/lib/python3.4/gzip.py", line 485, in _read_eof
    hex(self.crc)))
    OSError: CRC check failed 0x1036a2e1 != 0x0

    @sgnn7 sgnn7 mannequin added the type-bug An unexpected behavior, bug, or error label Mar 12, 2015
    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    Also, extract_member in tarfile.py is not thread-safe since the check for folder existence might occur during another thread's creation of that same dir causing the code to error out.

    File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
    File "./xdelta3-dir-patcher", line 499, in _apply_file_delta
    archive_object.expand(patch_file, staging_dir)
    File "./xdelta3-dir-patcher", line 284, in expand
    self.archive_object.extract(member, extraction_path)
    File "/usr/lib/python3.4/tarfile.py", line 2019, in extract
    set_attrs=set_attrs)
    File "/usr/lib/python3.4/tarfile.py", line 2080, in _extract_member
    os.makedirs(upperdirs)
    File "/usr/lib/python3.4/os.py", line 237, in makedirs
    mkdir(name, mode)
    FileExistsError: [Errno 17] File exists: '/tmp/XDelta3DirPatcher_is0y4_5f/xdelta/updated folder'

    Code causing problems:
    2065 def _extract_member(self, tarinfo, targetpath, set_attrs=True):
    ...
    2075 # Create all upper directories.
    2076 upperdirs = os.path.dirname(targetpath)
    2077 if upperdirs and not os.path.exists(upperdirs):
    ...
    2080 os.makedirs(upperdirs) # Fails since the dir might be already created between lines 2077 and 2080

    @sgnn7 sgnn7 mannequin added type-feature A feature request or enhancement type-bug An unexpected behavior, bug, or error and removed type-bug An unexpected behavior, bug, or error type-feature A feature request or enhancement labels Mar 12, 2015
    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    The code around tarfile multi-threading was fixed for me on the user-side with threading.Lock() usage so it might work to use this within the library and the directory creation could be improved by probably doing a try/except around the makedirs() call with ignoring of the exception if it's FileExistsError - my code I use elsewhere fixes this with:
    def _safe_makedirs(self, dir_path):
    try:
    makedirs(dir_path)
    # Concurrency problems need to be handled. If two threads create
    # the same dir, there might be a race between them checking and
    # doing makedirs so we handle that as gracefully as possible here.
    except FileExistsError as fee:
    if not os.path.isdir(dir_path):
    raise fee

    If I get time, I'll submit a patch but it seems like I probably won't for this.

    @bitdancer
    Copy link
    Member

    If you want to use an object that has state in more than one thread you generally have to put some locking around it. Unless I'm missing something (which I might be) I don't think it is tarfile's responsibility to do this.

    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    I don't know if that's true of core libraries. Why complicate things for end users when those issues could be done in the library itself and be completely transparent to the devs? A simple RLock latch wouldn't pose almost any speed degradation but would work in both threaded and non-threaded situations as expected.

    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    After some thinking, for the makedirs it should only need makedirs(exist_ok=True)

    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    Patch for the multithreaded expansion of files and use of makedirs.

    @sgnn7
    Copy link
    Mannequin Author

    sgnn7 mannequin commented Mar 12, 2015

    The whole lib still needs the threading locks added but the patch submitted should fix things for people that do the locking from their code.

    @gustaebel
    Copy link
    Mannequin

    gustaebel mannequin commented Mar 16, 2015

    I agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway. So, it seems best to me if we leave it like it is and let the user add locks around it as she/he sees fit.

    @vstinner
    Copy link
    Member

    Lars Gustäbel added the comment:

    I agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway. So, it seems best to me if we leave it like it is and let the user add locks around it as she/he sees fit.

    In asyncio, it was a design choice to not be thread-safe, to allow
    more optimizations and support multiple implementations of asyncio,
    without this important constraint.

    I modified recently the asyncio doc to warn users in each class that
    asyncio objects are *not* thread safe, with an explanation how to use
    correctly asyncio with threads.

    https://docs.python.org/dev/library/asyncio-eventloop.html#asyncio.BaseEventLoop
    "This class is not thread safe."

    Such change in tarfile doc is probably enough for tarfile.

    @xdegaye
    Copy link
    Mannequin

    xdegaye mannequin commented Apr 7, 2018

    extract_from_pkgs() in the attached extract_from_packages.py script extracts /etc files from the tar files in PKG_DIR into WORK_DIR using a ThreadPoolExecutor (a ThreadPoolExecutor, when used to extract all the /etc files from the packages that build a whole ArchLinux system, divides the elapsed time by 2). Running this script that tests this function fails randomly with the same error as reported by Srdjan in msg237961.

    Replacing ThreadPoolExecutor with ProcessPoolExecutor also fails randomly.

    Using the safe_makedirs() context manager to enclose the statements than run ThreadPoolExecutor fixes the problem.

    Obviously this in not a problem related to thread-safety (it occurs also with ProcessPoolExecutor) but a problem about the robustness of the tarfile module in a concurrent access context. The problem is insidious in that it may never occur in an application test suite.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @piskvorky
    Copy link

    piskvorky commented Mar 4, 2023

    I was also bitten by this. I process large nested archives (TARs within TARs with ZIPs etc) in read-only mode in parallel, and noticed strange inconsistencies in the number of extracted files. No exceptions – just for tar_info in tar_file: … iteration sometimes ended prematurely. I tracked this problem down to tarfile and multithreading.

    I have to disagree with @gustaebel 's conclusion:

    agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway.

    There is much to be gained by threading (at least in my application), and the resource does not have to be accessed sequentially. tarfile already seeks around, each object/method knows its offset.

    Like @sgnn7, I fixed the issue by adding locks around TarFile's open() & next() & extractfile() calls in my code, that did it. But not having to worry about this would be great. Actually noticing this issue in the first place was the biggest challenge – there were no exceptions! Just mysteriously incomplete data.

    Unlike @sgnn7, I didn't get any exception – without the locks, tarfile.next() silently swallows a tarfile.InvalidHeaderError: invalid header exception (caused by some internal seek/tell offset corruption due to multithreading), which means next() returned tarinfo = None, which means it ended the TAR iteration early. No errors, just missing data.

    "This class is not thread safe."
    Such change in tarfile doc is probably enough for tarfile.

    Fair. Currently the word "thread" does not appear on https://docs.python.org/3/library/tarfile.html.

    As an aside, not having to call tar_file.members = [] regularly during iteration (to avoid OOM due to TarFile's internal TarInfo caching) would be great too. But that's not related to threading, and easier to spot and fix.

    EDIT: for inspiration, I was checking how zipfile handles threading, and found analogous threading woes and merged fix there. zipfile uses locks around its seek/tell/offset to achieve thread safety.

    @shughes-uk
    Copy link
    Contributor

    shughes-uk commented Apr 5, 2023

    We ran into the mkdirs race condition.

    Our use case is extracting a list of tar files concurrently. Each tar file has files in it that may have overlapping parent paths with other tar files. We are actually using multiprocessing rather than threading in this instance.

    @arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Apr 5, 2023
    serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Feb 6, 2024
    …action
    
    Avoid race conditions in the creation of directories during concurrent
    extraction in tarfile and zipfile.
    
    Co-authored-by: Samantha Hughes <[email protected]>
    Co-authored-by: Peder Bergebakken Sundt <[email protected]>
    serhiy-storchaka added a commit that referenced this issue Feb 11, 2024
    …15082)
    
    Avoid race conditions in the creation of directories during concurrent
    extraction in tarfile and zipfile.
    
    Co-authored-by: Samantha Hughes <[email protected]>
    Co-authored-by: Peder Bergebakken Sundt <[email protected]>
    fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this issue Feb 14, 2024
    …action (pythonGH-115082)
    
    Avoid race conditions in the creation of directories during concurrent
    extraction in tarfile and zipfile.
    
    Co-authored-by: Samantha Hughes <[email protected]>
    Co-authored-by: Peder Bergebakken Sundt <[email protected]>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    Status: Done
    Development

    No branches or pull requests

    6 participants