Format File Sizes Human-Readable in the CLI #2702

felixhandte · 2021-06-09T17:16:54Z

This PR extends @scottchiefbaker's #2696. It switches zstd's CLI output to printing human-readable representations of file sizes, rather than full-precision integers.

This table shows how this PR formats various sizes in comparison to ls -lh. There are some differences, but in general I prefer this formatting over ls's, since this provides more consistent 3-4 digits of precision and rounds-to-nearest rather than always rounding-up.

Size	`zstd`	`ls -lh`
1	1 B	1
12	12 B	12
123	123 B	123
1234	1.21 KiB	1.3K
12345	12.1 KiB	13K
123456	121 KiB	121K
1234567	1.18 MiB	1.2M
12345678	11.8 MiB	12M
123456789	118 MiB	118M
1234567890	1.15 GiB	1.2G
999	999 B	999
1000	1000 B	1000
1001	1001 B	1001
1023	1023 B	1023
1024	1.000 KiB	1.0K
1025	1.00 KiB	1.1K
999999	977 KiB	977K
1000000	977 KiB	977K
1000001	977 KiB	977K
1023999	1000 KiB	1000K
1024000	1000 KiB	1000K
1024001	1000 KiB	1001K
1048575	1024 KiB	1.0M
1048576	1.000 MiB	1.0M
1048577	1.00 MiB	1.1M

Repro Instructions:

for N in 1 12 123 1234 12345 123456 1234567 12345678 123456789 1234567890 999 1000 1001 1023 1024 1025 999999 1000000 1000001 1023999 1024000 1024001 1048575 1048576 1048577; do
  head -c $N /dev/urandom > r$N
done
./zstd -i1 -b1 -S r1 r12 r123 r1234 r12345 r123456 r1234567 r12345678 r123456789 r1234567890 r999 r1000 r1001 r1023 r1024 r1025 r999999 r1000000 r1000001 r1023999 r1024000 r1024001 r1048575 r1048576 r1048577

programs/util.c

Cyan4973 · 2021-06-09T17:26:30Z

I like the proposed new formatting.
Ready to validate once tests pass.

senhuang42 · 2021-06-09T17:34:14Z

I think the human readable format is generally better so this is great!

My only concern is that sometimes the full-precision ints are useful in development for spotting very small changes in compressed size when using -b#e# or even in general. I could also just compress the file and check the compressed result to spot these changes, but this gets a bit annoying when I'm trying out multiple levels. But then again, this human-readable thing was meant to improve the experience in the first place, so I think it's a valid enough concern.

Is there any easy way we can make it toggle back to raw bytes with -v, depending on g_displayLevel or something?

felixhandte · 2021-06-09T17:44:00Z

@senhuang42, good point. I was thinking about whether -v should expand these back to full precision but I couldn't think of a motivating use case. The one you provide is a good one that I missed.

The problem is that I'm not sure how I would best accomplish that. It wouldn't be enough to cancel the scaling and suffix. The value is currently stored as float, but even if it were a double, it would still lose the ability to carry full precision for files over 2^53 bytes in size. So to maintain full precision over the range of file sizes it would have to be transported as a uint64_t. But then it's unclear how to dispatch it in the format string...

You'd have to go back to preparing the value in a separate buffer. Which we could do, I guess. @Cyan4973, do you think that's worth doing?

felixhandte · 2021-06-09T17:51:28Z

Actually I think it's not so hard to return to working with our own buffer. I'll play around with that.

Cyan4973 · 2021-06-09T17:52:00Z

Yes, I believe that full-precision of size for benchmarking / measurement purposes is a good use case.

Also :

it would still lose the ability to carry full precision for files over 2^53 bytes in size

This represents 8 PB. It feels like an acceptable limitation.
Reducing maximum accuracy to KB level in this case also seems an acceptable mitigation.

felixhandte · 2021-06-09T20:21:45Z

Ok I've made a few changes (refer to commit messages). Let me know if you think these make sense. I chose to bump full precision display to require double-verbose because that way single-verbose still gets you human-readable display of each compression when processing multiple files.

Cyan4973 · 2021-06-09T21:24:57Z

I chose to bump full precision display to require double-verbose because that way single-verbose still gets you human-readable display of each compression when processing multiple files.

I agree, that seems a good choice.

aqrit · 2021-06-10T00:04:20Z

bike shedding: megabyte MB or mebibyte MiB

This produces the following formatting: Size | `zstd` | `ls -lh` ---------- | ------ | -------- 1 | 1 | 1 12 | 12 | 12 123 | 123 | 123 1234 | 1.21K | 1.3K 12345 | 12.1K | 13K 123456 | 121K | 121K 1234567 | 1.18M | 1.2M 12345678 | 11.8M | 12M 123456789 | 118M | 118M 1234567890 | 1.15G | 1.2G 999 | 999 | 999 1000 | 1000 | 1000 1001 | 1001 | 1001 1023 | 1023 | 1023 1024 | 1.000K | 1.0K 1025 | 1.00K | 1.1K 999999 | 977K | 977K 1000000 | 977K | 977K 1000001 | 977K | 977K 1023999 | 1000K | 1000K 1024000 | 1000K | 1000K 1024001 | 1000K | 1001K 1048575 | 1024K | 1.0M 1048576 | 1.000M | 1.0M 1048577 | 1.00M | 1.1M This was produced with the following invocation: ``` for N in 1 12 123 1234 12345 123456 1234567 12345678 123456789 1234567890 999 1000 1001 1023 1024 1025 999999 1000000 1000001 1023999 1024000 1024001 1048575 1048576 1048577; do head -c $N /dev/urandom > r$N done ./zstd -i1 -b1 -S r1 r12 r123 r1234 r12345 r123456 r1234567 r12345678 r123456789 r1234567890 r999 r1000 r1001 r1023 r1024 r1025 r999999 r1000000 r1000001 r1023999 r1024000 r1024001 r1048575 r1048576 r1048577 ```

@aqrit

Suggested by @aqrit, a little more verbose, but hopefully addresses a real ambiguity.

felixhandte · 2021-06-10T17:30:11Z

The new changes include the --list command:

$ ./zstd -l *.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     977 KiB       977 KiB  1.000  XXH64  r1000000.zst
     1      0     977 KiB       977 KiB  1.000  XXH64  r1000001.zst
     1      0    1014   B      1000   B  0.986  XXH64  r1000.zst
     1      0    1015   B      1001   B  0.986  XXH64  r1001.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1023999.zst
     1      0    1.01 KiB      1023   B  0.986  XXH64  r1023.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1024000.zst
     1      0    1000 KiB      1000 KiB  1.000  XXH64  r1024001.zst
     1      0    1.01 KiB     1.000 KiB  0.987  XXH64  r1024.zst
     1      0    1.01 KiB      1.00 KiB  0.987  XXH64  r1025.zst
     1      0    1.00 MiB      1024 KiB  1.000  XXH64  r1048575.zst
     1      0    1.00 MiB     1.000 MiB  1.000  XXH64  r1048576.zst
     1      0    1.00 MiB      1.00 MiB  1.000  XXH64  r1048577.zst
     1      0    1.15 GiB      1.15 GiB  1.000  XXH64  r1234567890.zst
     1      0     118 MiB       118 MiB  1.000  XXH64  r123456789.zst
     1      0    11.8 MiB      11.8 MiB  1.000  XXH64  r12345678.zst
     1      0    1.18 MiB      1.18 MiB  1.000  XXH64  r1234567.zst
     1      0     121 KiB       121 KiB  1.000  XXH64  r123456.zst
     1      0    12.1 KiB      12.1 KiB  0.999  XXH64  r12345.zst
     1      0    1.22 KiB      1.21 KiB  0.989  XXH64  r1234.zst
     1      0     136   B       123   B  0.904  XXH64  r123.zst
     1      0      25   B        12   B  0.480  XXH64  r12.zst
     1      0      14   B         1   B  0.071  XXH64  r1.zst
     1      0     977 KiB       977 KiB  1.000  XXH64  r999999.zst
     1      0    1013   B       999   B  0.986  XXH64  r999.zst
----------------------------------------------------------------- 
    28      0    2.56 GiB      2.56 GiB  1.000  XXH64  28 files

As well as in-progress compression in various ways:

$ ./zstd -f -9 r*
Compress: 35/50 files. Current: r123456789 Read:   108 MiB /   118 MiB ==> 100%

$ ./zstd -f -v -9 r*
*** zstd command line interface 64-bits v1.5.0, by Yann Collet ***
r1                   :1400.00%   (     1   B =>     14   B, r1.zst)            .00% 
r1000                :101.40%   (  1000   B =>   1014   B, r1000.zst)          
r1000000             :100.00%   (   977 KiB =>    977 KiB, r1000000.zst)       
r1000001             :100.00%   (   977 KiB =>    977 KiB, r1000001.zst)       
...
r12345678            :100.00%   (  11.8 MiB =>   11.8 MiB, r12345678.zst)      0.00% 
(L9) Buffered :  10.5 MiB - Consumed :  21.5 MiB - Compressed :  21.5 MiB => 100.00%

$ ./zstd -f -vv -9 r*
*** zstd command line interface 64-bits v1.5.0, by Yann Collet ***
r1                   :1400.00%   (     1   B =>     14   B, r1.zst)            00.00% 
r1                   : Completed in 0.00 sec  (cpu load : 96%)
r1000                :101.40%   (  1000   B =>   1014   B, r1000.zst)          
r1000                : Completed in 0.00 sec  (cpu load : 98%)
r1000000             :100.00%   (1000000   B => 1000037   B, r1000000.zst)     
r1000000             : Completed in 0.02 sec  (cpu load : 103%)
...
r123456789           :100.00%   (123456789   B => 123459629   B, r123456789.zst)  => 100.00% 
r123456789           : Completed in 1.21 sec  (cpu load : 118%)
(L9) Buffered :10616832   B - Consumed :1134559232   B - Compressed :1134585210   B => 100.00%

Cyan4973 · 2021-06-10T17:42:01Z

Thanks @felixhandte ! Looks like a great improvement !

scottchiefbaker · 2021-06-10T21:01:04Z

Let me start with: Thanks for working on this and getting merged so quickly.

Small nitpick though:

:./zstd -b1 -e10 -f /tmp/foo.bin
 1#foo.bin           : 268 MiB -> 6.98 MiB (38.36), 2002.5 MB/s, 5363.6 MB/s 
 2#foo.bin           : 268 MiB -> 6.82 MiB (39.25), 1916.1 MB/s, 5268.9 MB/s 
 3#foo.bin           : 268 MiB -> 7.87 MiB (34.02), 1619.2 MB/s, 4759.0 MB/s 
 4#foo.bin           : 268 MiB -> 7.87 MiB (34.00), 1571.1 MB/s, 4753.0 MB/s

We have conflicting data types? MiB for the before/after but MB/s for the rate? I would expect those to be the same.

terrelln · 2021-06-10T22:21:20Z

We have conflicting data types? MiB for the before/after but MB/s for the rate? I would expect those to be the same.

The speeds are also in MiB/s, they just use MB/s to mean that. We should switch them to print MiB/s as well.

Cyan4973 · 2021-06-10T22:35:14Z

Actually, speeds are indeed in MB/s.
We have to keep it that way, in order to generate results comparable between versions.

scottchiefbaker · 2021-06-10T22:40:29Z

If that is indeed the case, then why don't we change the new output to also be in MB instead of MiB?

Since we're introducing the new human string format, no one is expecting it to be any format, let's make that MB. That way both units are the same.

facebook-github-bot added the CLA Signed label Jun 9, 2021

Cyan4973 reviewed Jun 9, 2021

View reviewed changes

programs/util.c Outdated Show resolved Hide resolved

Cyan4973 approved these changes Jun 9, 2021

View reviewed changes

scottchiefbaker and others added 19 commits June 10, 2021 12:53

Make the CLI output the file sizes in human readable format

26fab1d

Put the human_size() function in util.c

b70175e

Convert names to CamelCase

b6b23df

Make the variable types match

eefdbcd

Move the variable declarations to the top

4e0d9f1

Use human_size() in the benchmark output also

894698d

Use human_size() on the "multiple files compressed" output also

77001f0

Convert tabs to spaces

35576e6

human_size() should use size_t

e5fc830

Use unsigned long instead to help with some tests

1ef6f3d

Update humanSize() to skip the big numbers (it requires 64 bit)

64385ef

Try unsigned long long

20b9b00

Try enabling the BIG strings now the unsigned long long is in effect

376a273

Some fixes to address things @felixhandte found

1eb8528

Attempt to fix a failing test with help from @aqrit

8e0a969

Fix Integer Constants; Fix Comparison

9b67219

In Verbose Mode, Preserve Full Precision Where Possible

464bfb0

Change Suffix (e.g., "G" -> " GB")

93bb368

felixhandte added 5 commits June 10, 2021 12:53

Fix Whitespace

7e00588

Apply to Other Print Statement as Well

bc46b6e

Require -vv to Enable Full Precision

9c340ce

Switch to Binary Size Prefixes (e.g., "MB" -> "MiB")

2af3687

Suggested by @aqrit, a little more verbose, but hopefully addresses a real ambiguity.

Convert Other Size Displays to Use Human-Readable Formatting

87e94e3

felixhandte force-pushed the human_size_output branch from dbaab7b to 87e94e3 Compare June 10, 2021 16:58

felixhandte added 2 commits June 10, 2021 13:14

Update Tests to Reflect New Formatting

94cf57b

Whitespace Fixes to Improve Cross-Line Alignment

8c00807

felixhandte merged commit 67a2596 into facebook:dev Jun 10, 2021

felixhandte deleted the human_size_output branch June 10, 2021 20:55

felixhandte linked an issue Jun 10, 2021 that may be closed by this pull request

Compression results are hard to read if numbers are very large #2694

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Format File Sizes Human-Readable in the CLI #2702

Format File Sizes Human-Readable in the CLI #2702

felixhandte commented Jun 9, 2021 •

edited

Loading

Cyan4973 commented Jun 9, 2021

senhuang42 commented Jun 9, 2021

felixhandte commented Jun 9, 2021

felixhandte commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021 •

edited

Loading

felixhandte commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021

aqrit commented Jun 10, 2021

felixhandte commented Jun 10, 2021

Cyan4973 commented Jun 10, 2021

scottchiefbaker commented Jun 10, 2021

terrelln commented Jun 10, 2021 •

edited

Loading

Cyan4973 commented Jun 10, 2021

scottchiefbaker commented Jun 10, 2021

Format File Sizes Human-Readable in the CLI #2702

Format File Sizes Human-Readable in the CLI #2702

Conversation

felixhandte commented Jun 9, 2021 • edited Loading

Cyan4973 commented Jun 9, 2021

senhuang42 commented Jun 9, 2021

felixhandte commented Jun 9, 2021

felixhandte commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021 • edited Loading

felixhandte commented Jun 9, 2021

Cyan4973 commented Jun 9, 2021

aqrit commented Jun 10, 2021

felixhandte commented Jun 10, 2021

Cyan4973 commented Jun 10, 2021

scottchiefbaker commented Jun 10, 2021

terrelln commented Jun 10, 2021 • edited Loading

Cyan4973 commented Jun 10, 2021

scottchiefbaker commented Jun 10, 2021

felixhandte commented Jun 9, 2021 •

edited

Loading

Cyan4973 commented Jun 9, 2021 •

edited

Loading

terrelln commented Jun 10, 2021 •

edited

Loading