Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format File Sizes Human-Readable in the CLI #2702

Merged
merged 26 commits into from
Jun 10, 2021

Conversation

felixhandte
Copy link
Contributor

@felixhandte felixhandte commented Jun 9, 2021

This PR extends @scottchiefbaker's #2696. It switches zstd's CLI output to printing human-readable representations of file sizes, rather than full-precision integers.

This table shows how this PR formats various sizes in comparison to ls -lh. There are some differences, but in general I prefer this formatting over ls's, since this provides more consistent 3-4 digits of precision and rounds-to-nearest rather than always rounding-up.

Size zstd ls -lh
1 1 B 1
12 12 B 12
123 123 B 123
1234 1.21 KiB 1.3K
12345 12.1 KiB 13K
123456 121 KiB 121K
1234567 1.18 MiB 1.2M
12345678 11.8 MiB 12M
123456789 118 MiB 118M
1234567890 1.15 GiB 1.2G
999 999 B 999
1000 1000 B 1000
1001 1001 B 1001
1023 1023 B 1023
1024 1.000 KiB 1.0K
1025 1.00 KiB 1.1K
999999 977 KiB 977K
1000000 977 KiB 977K
1000001 977 KiB 977K
1023999 1000 KiB 1000K
1024000 1000 KiB 1000K
1024001 1000 KiB 1001K
1048575 1024 KiB 1.0M
1048576 1.000 MiB 1.0M
1048577 1.00 MiB 1.1M
Repro Instructions:
for N in 1 12 123 1234 12345 123456 1234567 12345678 123456789 1234567890 999 1000 1001 1023 1024 1025 999999 1000000 1000001 1023999 1024000 1024001 1048575 1048576 1048577; do
  head -c $N /dev/urandom > r$N
done
./zstd -i1 -b1 -S r1 r12 r123 r1234 r12345 r123456 r1234567 r12345678 r123456789 r1234567890 r999 r1000 r1001 r1023 r1024 r1025 r999999 r1000000 r1000001 r1023999 r1024000 r1024001 r1048575 r1048576 r1048577

programs/util.c Outdated Show resolved Hide resolved
@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 9, 2021

I like the proposed new formatting.
Ready to validate once tests pass.

@senhuang42
Copy link
Contributor

I think the human readable format is generally better so this is great!

My only concern is that sometimes the full-precision ints are useful in development for spotting very small changes in compressed size when using -b#e# or even in general. I could also just compress the file and check the compressed result to spot these changes, but this gets a bit annoying when I'm trying out multiple levels. But then again, this human-readable thing was meant to improve the experience in the first place, so I think it's a valid enough concern.

Is there any easy way we can make it toggle back to raw bytes with -v, depending on g_displayLevel or something?

@felixhandte
Copy link
Contributor Author

@senhuang42, good point. I was thinking about whether -v should expand these back to full precision but I couldn't think of a motivating use case. The one you provide is a good one that I missed.

The problem is that I'm not sure how I would best accomplish that. It wouldn't be enough to cancel the scaling and suffix. The value is currently stored as float, but even if it were a double, it would still lose the ability to carry full precision for files over 2^53 bytes in size. So to maintain full precision over the range of file sizes it would have to be transported as a uint64_t. But then it's unclear how to dispatch it in the format string...

You'd have to go back to preparing the value in a separate buffer. Which we could do, I guess. @Cyan4973, do you think that's worth doing?

@felixhandte
Copy link
Contributor Author

Actually I think it's not so hard to return to working with our own buffer. I'll play around with that.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 9, 2021

Yes, I believe that full-precision of size for benchmarking / measurement purposes is a good use case.

Also :

it would still lose the ability to carry full precision for files over 2^53 bytes in size

This represents 8 PB. It feels like an acceptable limitation.
Reducing maximum accuracy to KB level in this case also seems an acceptable mitigation.

@felixhandte
Copy link
Contributor Author

Ok I've made a few changes (refer to commit messages). Let me know if you think these make sense. I chose to bump full precision display to require double-verbose because that way single-verbose still gets you human-readable display of each compression when processing multiple files.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Jun 9, 2021

I chose to bump full precision display to require double-verbose because that way single-verbose still gets you human-readable display of each compression when processing multiple files.

I agree, that seems a good choice.

@aqrit
Copy link
Contributor

aqrit commented Jun 10, 2021

bike shedding: megabyte MB or mebibyte MiB

scottchiefbaker and others added 19 commits June 10, 2021 12:53
This produces the following formatting:

   Size    | `zstd` | `ls -lh`
---------- | ------ | --------
1          | 1      | 1
12         | 12     | 12
123        | 123    | 123
1234       | 1.21K  | 1.3K
12345      | 12.1K  | 13K
123456     | 121K   | 121K
1234567    | 1.18M  | 1.2M
12345678   | 11.8M  | 12M
123456789  | 118M   | 118M
1234567890 | 1.15G  | 1.2G
999        | 999    | 999
1000       | 1000   | 1000
1001       | 1001   | 1001
1023       | 1023   | 1023
1024       | 1.000K | 1.0K
1025       | 1.00K  | 1.1K
999999     | 977K   | 977K
1000000    | 977K   | 977K
1000001    | 977K   | 977K
1023999    | 1000K  | 1000K
1024000    | 1000K  | 1000K
1024001    | 1000K  | 1001K
1048575    | 1024K  | 1.0M
1048576    | 1.000M | 1.0M
1048577    | 1.00M  | 1.1M

This was produced with the following invocation:

```
for N in 1 12 123 1234 12345 123456 1234567 12345678 123456789 1234567890 999 1000 1001 1023 1024 1025 999999 1000000 1000001 1023999 1024000 1024001 1048575 1048576 1048577; do
  head -c $N /dev/urandom > r$N
done
./zstd -i1 -b1 -S r1 r12 r123 r1234 r12345 r123456 r1234567 r12345678 r123456789 r1234567890 r999 r1000 r1001 r1023 r1024 r1025 r999999 r1000000 r1000001 r1023999 r1024000 r1024001 r1048575 r1048576 r1048577
```
@felixhandte
Copy link
Contributor Author

The new changes include the --list command:

$ ./zstd -l *.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     977 KiB       977 KiB  1.000  XXH64  r1000000.zst
     1      0     977 KiB       977 KiB  1.000  XXH64  r1000001.zst
     1      0    1014   B      1000   B  0.986  XXH64  r1000.zst
     1      0    1015   B      1001   B  0.986  XXH64  r1001.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1023999.zst
     1      0    1.01 KiB      1023   B  0.986  XXH64  r1023.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1024000.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1024001.zst
     1      0    1.01 KiB     1.000 KiB  0.987  XXH64  r1024.zst
     1      0    1.01 KiB      1.00 KiB  0.987  XXH64  r1025.zst
     1      0    1.00 MiB      1024 KiB  1.000  XXH64  r1048575.zst
     1      0    1.00 MiB     1.000 MiB  1.000  XXH64  r1048576.zst
     1      0    1.00 MiB      1.00 MiB  1.000  XXH64  r1048577.zst
     1      0    1.15 GiB      1.15 GiB  1.000  XXH64  r1234567890.zst
     1      0     118 MiB       118 MiB  1.000  XXH64  r123456789.zst
     1      0    11.8 MiB      11.8 MiB  1.000  XXH64  r12345678.zst
     1      0    1.18 MiB      1.18 MiB  1.000  XXH64  r1234567.zst
     1      0     121 KiB       121 KiB  1.000  XXH64  r123456.zst
     1      0    12.1 KiB      12.1 KiB  0.999  XXH64  r12345.zst
     1      0    1.22 KiB      1.21 KiB  0.989  XXH64  r1234.zst
     1      0     136   B       123   B  0.904  XXH64  r123.zst
     1      0      25   B        12   B  0.480  XXH64  r12.zst
     1      0      14   B         1   B  0.071  XXH64  r1.zst
     1      0     977 KiB       977 KiB  1.000  XXH64  r999999.zst
     1      0    1013   B       999   B  0.986  XXH64  r999.zst
----------------------------------------------------------------- 
    28      0    2.56 GiB      2.56 GiB  1.000  XXH64  28 files

As well as in-progress compression in various ways:

$ ./zstd -f -9 r*
Compress: 35/50 files. Current: r123456789 Read:   108 MiB /   118 MiB ==> 100%
$ ./zstd -f -v -9 r*
*** zstd command line interface 64-bits v1.5.0, by Yann Collet ***
r1                   :1400.00%   (     1   B =>     14   B, r1.zst)            .00% 
r1000                :101.40%   (  1000   B =>   1014   B, r1000.zst)          
r1000000             :100.00%   (   977 KiB =>    977 KiB, r1000000.zst)       
r1000001             :100.00%   (   977 KiB =>    977 KiB, r1000001.zst)       
...
r12345678            :100.00%   (  11.8 MiB =>   11.8 MiB, r12345678.zst)      0.00% 
(L9) Buffered :  10.5 MiB - Consumed :  21.5 MiB - Compressed :  21.5 MiB => 100.00%
$ ./zstd -f -vv -9 r*
*** zstd command line interface 64-bits v1.5.0, by Yann Collet ***
r1                   :1400.00%   (     1   B =>     14   B, r1.zst)            00.00% 
r1                   : Completed in 0.00 sec  (cpu load : 96%)
r1000                :101.40%   (  1000   B =>   1014   B, r1000.zst)          
r1000                : Completed in 0.00 sec  (cpu load : 98%)
r1000000             :100.00%   (1000000   B => 1000037   B, r1000000.zst)     
r1000000             : Completed in 0.02 sec  (cpu load : 103%)
...
r123456789           :100.00%   (123456789   B => 123459629   B, r123456789.zst)  => 100.00% 
r123456789           : Completed in 1.21 sec  (cpu load : 118%)
(L9) Buffered :10616832   B - Consumed :1134559232   B - Compressed :1134585210   B => 100.00%

@Cyan4973
Copy link
Contributor

Thanks @felixhandte ! Looks like a great improvement !

@felixhandte felixhandte merged commit 67a2596 into facebook:dev Jun 10, 2021
@felixhandte felixhandte deleted the human_size_output branch June 10, 2021 20:55
@felixhandte felixhandte linked an issue Jun 10, 2021 that may be closed by this pull request
@scottchiefbaker
Copy link
Contributor

Let me start with: Thanks for working on this and getting merged so quickly.

Small nitpick though:

:./zstd -b1 -e10 -f /tmp/foo.bin
 1#foo.bin           : 268 MiB -> 6.98 MiB (38.36), 2002.5 MB/s, 5363.6 MB/s 
 2#foo.bin           : 268 MiB -> 6.82 MiB (39.25), 1916.1 MB/s, 5268.9 MB/s 
 3#foo.bin           : 268 MiB -> 7.87 MiB (34.02), 1619.2 MB/s, 4759.0 MB/s 
 4#foo.bin           : 268 MiB -> 7.87 MiB (34.00), 1571.1 MB/s, 4753.0 MB/s 

We have conflicting data types? MiB for the before/after but MB/s for the rate? I would expect those to be the same.

@terrelln
Copy link
Contributor

terrelln commented Jun 10, 2021

We have conflicting data types? MiB for the before/after but MB/s for the rate? I would expect those to be the same.

The speeds are also in MiB/s, they just use MB/s to mean that. We should switch them to print MiB/s as well.

@Cyan4973
Copy link
Contributor

Actually, speeds are indeed in MB/s.
We have to keep it that way, in order to generate results comparable between versions.

@scottchiefbaker
Copy link
Contributor

If that is indeed the case, then why don't we change the new output to also be in MB instead of MiB?

Since we're introducing the new human string format, no one is expecting it to be any format, let's make that MB. That way both units are the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compression results are hard to read if numbers are very large
7 participants