-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gzip.compress(..., mtime=0) in cpython 3.11+ unexpectedly sets OS byte in gzip header #112346
Comments
In itself I think it may be a good thing that the The problem is just that the change is not explicitly documented, as far as I know. |
to reproduce python/cpython#112346
Hi @dennisvang . This is my fault, as I delegated gzip.compress(mtime=0) to zlib.compress, incorrectly assuming this was the same. The reason is that zlib.compress is faster. But if it leads to behavioral changes, that is not acceptable. I believe this can easily be remedied by removing the codepath. |
I have made a PR. Just now and put Bugfix in the name. Now I hope it will get attention. |
@rhpvorderman Thanks for picking this up. I wonder, if this is the only side-effect, and if the performance gain from using |
Well, as mentioned in the PR, keeping two separate code paths caused issues before. It is best to keep one codepath. There is a mention in the documentation about zlib.compress so users who need the performance can use it themselves. |
@rhpvorderman You're right, that makes sense. |
ping |
For reference, this feature was added in bpo-43613 (gh-87779). It included more optimizations, the only issue with delegating the whole compression to zlib, when mtime is 0. The fix looks correct and it still preserves some speed up. An alternate solution could be to call |
Is this really a big deal? We won't be able to backport this to 3.11 as that is in security fix only mode.
zlib cannot be presumed to produce canonical output. There are many different zlib implementations. Decompressing gzipped data that you did not produce and recompressing it without using identical software is already not guaranteed to produce the same compressed output. From a reproducible build perspective I suggest always patching irrelevant fields such as gzip header mtime and OS fields to constant values as part of the build. |
@gpshead Probably not for most people. As commented above, I was only hoping for a small note to be added to the changelog (or documentation). Just in case someone does rely on the OS byte in the gzip header, in whatever context.
I just mentioned the reproducible build example to provide some context, although it was an edge case involving files created on the same machine, with exact same zlib implementation, but a different python version. |
This may be not such big deal, but it is still a problem. Since
The current behavior is not included in reasonable options. There are also two implementation options:
This should be decided based on relative timing of these two methods. For now, #114116 looks like a simple and safe option, but you are welcome to bikeshedding. |
The later is definitely quicker as the crc calculation also happens in one go. Using zlib.compress with wbits 31 and then always patching the header for more consistent results and should be a faster default path. |
I did some benchmarking and special-casing mtime=0 does not provide much benefit:
The problem is that I did not separately benchmark this codepath at the time, as it seemed to me that doing everything in C is obviously faster than using struct.pack in combination with building new bytes objects in memory. However, DEFLATE compression is apparently so expensive, even on level 1 that this does not matter. I also made a new PR. That makes zlib always write the trailer and the header is simply replacing parts of the zlib header. The speed is the same, but it simplifies the code a lot, and always guarantees the OS byte being set to 255. There is no separate codepath for mtime=0. The xfl byte is now set by zlib, as it ideally should be. The resulting code should be easier to maintain going forward. |
This is what I'd call our ideal behavior. Where we can't do that, documenting that the field may be set to different values in different situations is worthwhile which my draft docs PR does. I like the look of your #120486 change. We can back-port that to 3.13 as it is fine to make such a change during the beta period. |
…GH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent.
…ction. (pythonGH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent. (cherry picked from commit 08d09cf) Co-authored-by: Ruben Vorderman <[email protected]>
…nction. (GH-120486) (#120563) gh-112346: Always set OS byte to 255, simpler gzip.compress function. (GH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent. (cherry picked from commit 08d09cf) Co-authored-by: Ruben Vorderman <[email protected]>
…e in 3.11 (pythonGH-120480) (cherry picked from commit bac4eda) Co-authored-by: Gregory P. Smith <[email protected]> pythongh-112346: Describe the "os" byte in gzip output change.
…e in 3.11 (pythonGH-120480) (cherry picked from commit bac4eda) Co-authored-by: Gregory P. Smith <[email protected]> pythongh-112346: Describe the "os" byte in gzip output change.
…e in 3.11 (pythonGH-120480) (cherry picked from commit bac4eda) Co-authored-by: Gregory P. Smith <[email protected]> pythongh-112346: Describe the "os" byte in gzip output change.
…ction. (pythonGH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent.
…e in 3.11 (python#120480) pythongh-112346: Describe the "os" byte in gzip output change.
…ction. (pythonGH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent.
…e in 3.11 (python#120480) pythongh-112346: Describe the "os" byte in gzip output change.
…ction. (pythonGH-120486) This matches the output behavior in 3.10 and earlier; the optimization in 3.11 allowed the zlib library's "os" value to be filled in instead in the circumstance when mtime was 0. this keeps things consistent.
…e in 3.11 (python#120480) pythongh-112346: Describe the "os" byte in gzip output change.
Bug report
description
Using
gzip.compress()
withmtime=0
in 3.8<=cpython<=3.10, theOS
byte, i.e. the 10th byte in the GZIP header, is set to255
"unknown" (also see e.g. #83302):cpython/Lib/gzip.py
Line 599 in dc0adb4
However, in cpython 3.11 and 3.12, the
OS
byte is suddenly set to a "known" value, e.g.3
("Unix") on Ubuntu.This is not mentioned in the changelog for Python 3.11.
This may lead to problems in the context of reproducible builds. In our case, hash checking fails after decompressing and re-compressing a gzipped archive.
how to reproduce
Here's an example, where byte 10 is
\xff
in python 3.10 and\x03
in python 3.11:cause
I guess this is caused by python 3.11 delegating the
gzip.compress()
call tozlib
ifmtime=0
, as mentioned in the docs:and source:
cpython/Lib/gzip.py
Lines 609 to 612 in 89ddea4
Apparently
zlib
does set theOS
byte.CPython versions tested on:
3.8, 3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, macOS, Windows
Linked PRs
gzip.compress
output change in 3.11 #120480gzip.compress
output change in 3.11 (GH-120480) #120612gzip.compress
output change in 3.11 (GH-120480) #120613gzip.compress
output change in 3.11 (GH-120480) #120614The text was updated successfully, but these errors were encountered: