Fix Zip file tokenization #471

zekroTJA · 2024-01-25T13:00:14Z

Hey!

We've discovered some issues with the detection of some Microsoft PowerPoint files which seemingly contain nested elements like found in Excel project files.

The current implementation of the zipTokenizer looks for the PK\003\004 signature in the upcoming byte slice. This is problematic if the Zip file contains such elements inside the file contents of an entry itself, like when a Zip file contains another Zip file. Also, this could be problematic if contents of the extra fields contains this signature.

Below, you can find an example of an actual PPTX file which is falsely detected as XLS file.

I've changes the next method of the zipTokenizer so that it scans the actual fields of the file headers and skips the contents and extra fields of each entry, which fixes the problem.

This could also be a fix for the following issues.

Feel free to let me know what you think of these changes.

codecov-commenter · 2024-01-26T16:19:22Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f8b7d47) 94.96% compared to head (ba94dc5) 94.96%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #471   +/-   ##
=======================================
  Coverage   94.96%   94.96%           
=======================================
  Files           3        3           
  Lines         159      159           
=======================================
  Hits          151      151           
  Misses          6        6           
  Partials        2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gabriel-vasile

Hi @zekroTJA, thank you for looking into this complicated issue. I looked into this issue a few months ago, but I gave up because it was hairy issue and I didn't really have the time.
#230 and #120 are not related to this issue, but #400 definitely is.

I think it's ok to proceed with the change you're proposing but there are some things to take care of in the PR.

internal/magic/zip.go

zekroTJA · 2024-01-29T10:56:51Z

Hey, thank you very much for your review!

I've took a deeper look into the Zip specification and rewrote the implementation so that the mentioned issues should be resolved.

A method zipTokenizer#read has been added which safely checks the size of in before slicing it to the requested size. This fixes the out of bounds panic in the fuzzing test.
When the signature of the central directory of the end of a Zip file is encountered, the tokenizer will short circuit and exit.
next now checks for the file descriptor flag in the flags bit field. After that, it searches for the inofficial PK\u0007\u0008 signature, which may be included before the file descriptor field. If not found, it just recursively calls next again to look for the next occurrence of the file header signature. This is basically a fallback to the method of the previous implementation, because (as far as I understand) there is no deterministic way to find the end of compressed data because the uncompressed size field is set to 0 when the file descriptor flag is set.

I've tested the implementation against the default Zip archives, archives with file descriptors, recursive archives, recursive archives with file descriptors and with zip64 files (both forced with zip -6 and with actual big files requiring the zip64 format). Though this implementation does not properly handle zip64, it still works by going with the fallback of searching for the file header signature. Proper zip64 support could be added later on based on this implementation.

Feel free to let me know what you think of the implementation.

zekroTJA added 2 commits January 25, 2024 13:46

fix zipTokenizer

0006b69

add some documentation

ba94dc5

gabriel-vasile requested changes Jan 26, 2024

View reviewed changes

internal/magic/zip.go Outdated Show resolved Hide resolved

internal/magic/zip.go Outdated Show resolved Hide resolved

internal/magic/zip.go Outdated Show resolved Hide resolved

internal/magic/zip.go Outdated Show resolved Hide resolved

rewrite zipTokenizer

54adabc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Zip file tokenization #471

Fix Zip file tokenization #471

zekroTJA commented Jan 25, 2024

codecov-commenter commented Jan 26, 2024

gabriel-vasile left a comment

zekroTJA commented Jan 29, 2024

Fix Zip file tokenization #471

Are you sure you want to change the base?

Fix Zip file tokenization #471

Conversation

zekroTJA commented Jan 25, 2024

codecov-commenter commented Jan 26, 2024

Codecov Report

gabriel-vasile left a comment

Choose a reason for hiding this comment

zekroTJA commented Jan 29, 2024