-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Zip file tokenization #471
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #471 +/- ##
=======================================
Coverage 94.96% 94.96%
=======================================
Files 3 3
Lines 159 159
=======================================
Hits 151 151
Misses 6 6
Partials 2 2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @zekroTJA, thank you for looking into this complicated issue. I looked into this issue a few months ago, but I gave up because it was hairy issue and I didn't really have the time.
#230 and #120 are not related to this issue, but #400 definitely is.
I think it's ok to proceed with the change you're proposing but there are some things to take care of in the PR.
Hey, thank you very much for your review! I've took a deeper look into the Zip specification and rewrote the implementation so that the mentioned issues should be resolved.
I've tested the implementation against the default Zip archives, archives with file descriptors, recursive archives, recursive archives with file descriptors and with zip64 files (both forced with Feel free to let me know what you think of the implementation. |
Hey!
We've discovered some issues with the detection of some Microsoft PowerPoint files which seemingly contain nested elements like found in Excel project files.
The current implementation of the
zipTokenizer
looks for thePK\003\004
signature in the upcoming byte slice. This is problematic if the Zip file contains such elements inside the file contents of an entry itself, like when a Zip file contains another Zip file. Also, this could be problematic if contents of the extra fields contains this signature.Below, you can find an example of an actual PPTX file which is falsely detected as XLS file.
I've changes the
next
method of thezipTokenizer
so that it scans the actual fields of the file headers and skips the contents and extra fields of each entry, which fixes the problem.This could also be a fix for the following issues.
Feel free to let me know what you think of these changes.