Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Zip file tokenization #471

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

zekroTJA
Copy link

Hey!

We've discovered some issues with the detection of some Microsoft PowerPoint files which seemingly contain nested elements like found in Excel project files.

The current implementation of the zipTokenizer looks for the PK\003\004 signature in the upcoming byte slice. This is problematic if the Zip file contains such elements inside the file contents of an entry itself, like when a Zip file contains another Zip file. Also, this could be problematic if contents of the extra fields contains this signature.

Below, you can find an example of an actual PPTX file which is falsely detected as XLS file.
Screenshot from 2024-01-24 10-53-30

I've changes the next method of the zipTokenizer so that it scans the actual fields of the file headers and skips the contents and extra fields of each entry, which fixes the problem.

This could also be a fix for the following issues.

Feel free to let me know what you think of these changes.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f8b7d47) 94.96% compared to head (ba94dc5) 94.96%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #471   +/-   ##
=======================================
  Coverage   94.96%   94.96%           
=======================================
  Files           3        3           
  Lines         159      159           
=======================================
  Hits          151      151           
  Misses          6        6           
  Partials        2        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner

@gabriel-vasile gabriel-vasile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zekroTJA, thank you for looking into this complicated issue. I looked into this issue a few months ago, but I gave up because it was hairy issue and I didn't really have the time.
#230 and #120 are not related to this issue, but #400 definitely is.

I think it's ok to proceed with the change you're proposing but there are some things to take care of in the PR.

internal/magic/zip.go Outdated Show resolved Hide resolved
internal/magic/zip.go Outdated Show resolved Hide resolved
internal/magic/zip.go Outdated Show resolved Hide resolved
internal/magic/zip.go Outdated Show resolved Hide resolved
@zekroTJA
Copy link
Author

Hey, thank you very much for your review!

I've took a deeper look into the Zip specification and rewrote the implementation so that the mentioned issues should be resolved.

  • A method zipTokenizer#read has been added which safely checks the size of in before slicing it to the requested size. This fixes the out of bounds panic in the fuzzing test.

  • When the signature of the central directory of the end of a Zip file is encountered, the tokenizer will short circuit and exit.

  • next now checks for the file descriptor flag in the flags bit field. After that, it searches for the inofficial PK\u0007\u0008 signature, which may be included before the file descriptor field. If not found, it just recursively calls next again to look for the next occurrence of the file header signature. This is basically a fallback to the method of the previous implementation, because (as far as I understand) there is no deterministic way to find the end of compressed data because the uncompressed size field is set to 0 when the file descriptor flag is set.

I've tested the implementation against the default Zip archives, archives with file descriptors, recursive archives, recursive archives with file descriptors and with zip64 files (both forced with zip -6 and with actual big files requiring the zip64 format). Though this implementation does not properly handle zip64, it still works by going with the fallback of searching for the file header signature. Proper zip64 support could be added later on based on this implementation.

Feel free to let me know what you think of the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants