-
-
Notifications
You must be signed in to change notification settings - Fork 30.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tarfile not re-entrant for multi-threading #67837
Comments
When running tarfile.extract through multiple threads, the archive reading pointer is not protected from simultaneous seeks and causes various convoluted bugs: <some code> |
Also, extract_member in tarfile.py is not thread-safe since the check for folder existence might occur during another thread's creation of that same dir causing the code to error out. File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run Code causing problems: |
The code around tarfile multi-threading was fixed for me on the user-side with threading.Lock() usage so it might work to use this within the library and the directory creation could be improved by probably doing a try/except around the makedirs() call with ignoring of the exception if it's FileExistsError - my code I use elsewhere fixes this with: If I get time, I'll submit a patch but it seems like I probably won't for this. |
If you want to use an object that has state in more than one thread you generally have to put some locking around it. Unless I'm missing something (which I might be) I don't think it is tarfile's responsibility to do this. |
I don't know if that's true of core libraries. Why complicate things for end users when those issues could be done in the library itself and be completely transparent to the devs? A simple RLock latch wouldn't pose almost any speed degradation but would work in both threaded and non-threaded situations as expected. |
After some thinking, for the makedirs it should only need makedirs(exist_ok=True) |
Patch for the multithreaded expansion of files and use of makedirs. |
The whole lib still needs the threading locks added but the patch submitted should fix things for people that do the locking from their code. |
I agree with David that there is no need for tarfile to be thread-safe. There is nothing to be gained from distributing one TarFile object among multiple threads because it operates on a single resource which has to be accessed sequentially anyway. So, it seems best to me if we leave it like it is and let the user add locks around it as she/he sees fit. |
Lars Gustäbel added the comment:
In asyncio, it was a design choice to not be thread-safe, to allow I modified recently the asyncio doc to warn users in each class that https://docs.python.org/dev/library/asyncio-eventloop.html#asyncio.BaseEventLoop Such change in tarfile doc is probably enough for tarfile. |
extract_from_pkgs() in the attached extract_from_packages.py script extracts /etc files from the tar files in PKG_DIR into WORK_DIR using a ThreadPoolExecutor (a ThreadPoolExecutor, when used to extract all the /etc files from the packages that build a whole ArchLinux system, divides the elapsed time by 2). Running this script that tests this function fails randomly with the same error as reported by Srdjan in msg237961. Replacing ThreadPoolExecutor with ProcessPoolExecutor also fails randomly. Using the safe_makedirs() context manager to enclose the statements than run ThreadPoolExecutor fixes the problem. Obviously this in not a problem related to thread-safety (it occurs also with ProcessPoolExecutor) but a problem about the robustness of the tarfile module in a concurrent access context. The problem is insidious in that it may never occur in an application test suite. |
I was also bitten by this. I process large nested archives (TARs within TARs with ZIPs etc) in read-only mode in parallel, and noticed strange inconsistencies in the number of extracted files. No exceptions – just I have to disagree with @gustaebel 's conclusion:
There is much to be gained by threading (at least in my application), and the resource does not have to be accessed sequentially. Like @sgnn7, I fixed the issue by adding locks around TarFile's Unlike @sgnn7, I didn't get any exception – without the locks,
Fair. Currently the word "thread" does not appear on https://docs.python.org/3/library/tarfile.html. As an aside, not having to call EDIT: for inspiration, I was checking how |
We ran into the Our use case is extracting a list of |
…action Avoid race conditions in the creation of directories during concurrent extraction in tarfile and zipfile. Co-authored-by: Samantha Hughes <[email protected]> Co-authored-by: Peder Bergebakken Sundt <[email protected]>
…15082) Avoid race conditions in the creation of directories during concurrent extraction in tarfile and zipfile. Co-authored-by: Samantha Hughes <[email protected]> Co-authored-by: Peder Bergebakken Sundt <[email protected]>
…action (pythonGH-115082) Avoid race conditions in the creation of directories during concurrent extraction in tarfile and zipfile. Co-authored-by: Samantha Hughes <[email protected]> Co-authored-by: Peder Bergebakken Sundt <[email protected]>
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: