Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesize mismatch when decompressing multi-stream with sizes greater than 2GB (2^31) #61

Closed
alamb opened this issue Jul 14, 2020 · 6 comments · May be fixed by transparencies/zip#3
Closed

Comments

@alamb
Copy link
Contributor

alamb commented Jul 14, 2020

Here is a self contained reproducer of the problem: bzip_bug.zip

To reproduce:

unzip bzip_bug.zip
cd bzip_bug
cargo run --release --bin bzip_bug

You will see the following output:

$ cargo run --release --bin bzip_bug
   Compiling pkg-config v0.3.18
   Compiling libc v0.2.72
   Compiling cc v1.0.58
   Compiling bzip2-sys v0.1.9+1.0.8
   Compiling bzip2 v0.4.1
   Compiling bzip_bug v0.1.0 (/private/tmp/foo/bzip_bug)
    Finished release [optimized] target(s) in 9.08s
     Running `target/release/bzip_bug`
Generating expected results
Decompressing stream made with bzip2
Decompressing stream made with pbzip2
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `405900000`,
 right: `3000000000`: decompressed length mismatch', src/main.rs:37:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The expected output is that the program complete's sucessfully

The test has two files:

  1. raw.dat.bz2 is a 3GB generated file produced using the bzip2 tool on a mac
  2. raw.dat.pbz2 is the same 3GB generated file produced using the pbzip2 tool on a mac

The data was generated using the generate.rs tool in the package, via the following commands:

$ cargo build --release --all
$ ./target/release/generate | nice bzip2 > raw.dat.bz2&
$ ./target/release/generate | nice pbzip2 > raw.dat.pbz2&

You can check the output byte counts using bzcat:

$ bzcat raw.dat.bz2 | wc -c
  3000000000
$ bzcat raw.dat.pbz2 | wc -c
 3000000000

The issue seems to affect files that are larger than 2^31 (which smells like a u32 overflow somewhere)

@alexcrichton
Copy link
Owner

Thanks for the report! Is there some code to peek at as well? That'd help narrow down where the issue is in this crate.

@alamb
Copy link
Contributor Author

alamb commented Jul 14, 2020

Hi @alexcrichton -- yes sorry the code is in src/main.rs of the attached bzip_bug.zip. Copy / pasting below as well:

use std::fs::File;
use bzip2::read::{BzDecoder, MultiBzDecoder};
use std::io::{Read, Write};

fn main() -> Result<(), std::io::Error> {

    // decompress them with a bzip multi stream and compare the results (as well to the expected output)                                                                                                                        

    println!("Generating expected results");
    let mut expected = Vec::with_capacity(3_000_000_100);
    let data = "0123456789";
    for _ in 0..300_000_000 {
	expected.write(data.as_bytes())?;
    }
    let expected_len = 3_000_000_000;
    assert_eq!(expected.len(), expected_len);

    // This passes                                                                                                                                                                                                              
    println!("Decompressing stream made with bzip2");
    let mut decompressor = BzDecoder::new(File::open("raw.dat.bz2").expect("raw.dat.bz2 not found"));
    let mut contents = Vec::with_capacity(3_000_000_100);
    let num_read = decompressor.read_to_end(&mut contents).expect("error decompressing bz2 data");
    assert_eq!(num_read, expected_len, "decompressed length mismatch");
    assert_eq!(contents, expected, "data mismatch");

    // This fails with:                                                                                                                                                                                                         
    // ...                                                                                                                                                                                                                      
    // Decompressing stream made with pbzip2                                                                                                                                                                                    
    // thread 'main' panicked at 'assertion failed: `(left == right)`                                                                                                                                                           
    //   left: `405900000`,                                                                                                                                                                                                     
    //  right: `3000000000`: decompressed length mismatch', src/main.rs:30:5                                                                                                                                                    
    //note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                                                                                                                             
    println!("Decompressing stream made with pbzip2");
    let mut decompressor = MultiBzDecoder::new(File::open("raw.dat.pbz2").expect("raw.dat.pbz2 not found"));
    let mut contents = Vec::with_capacity(3_000_000_100);
    let num_read = decompressor.read_to_end(&mut contents).expect("error decompressing pdbz2 data");
    assert_eq!(num_read, expected_len, "decompressed length mismatch");
    assert_eq!(contents, expected, "data mismatch");

    println!("Done");
    Ok(())
}

@alexcrichton
Copy link
Owner

Oh gah sorry I missed the link from before, my bad!

@alexcrichton
Copy link
Owner

OK sorry I don't have a ton of time to look into this right now, but it may be a relatively easy bug to fix in MultiBzDecoder. For me though it will likely take quite a while for me to have time to get back to debugging this.

@alamb
Copy link
Contributor Author

alamb commented Jul 16, 2020

No worries @alexcrichton -- if I get a chance I will look into it too. I figured the reproducer was the most important thing to prepare so I wanted to get that posted.

@afflux
Copy link
Contributor

afflux commented Feb 16, 2021

I got hit by a similar bug today and this was a bit tricky to analyze. In short, this will signal EOF even for multistreams, if the input buffer lined up so that it starts with the bzip2 end-of-stream marker (as BZ2_bzDecompress will return BZ_STREAM_END and total_in will not have advanced, leaving read = 0):

return Ok(read);

afflux added a commit to afflux/bzip2-rs that referenced this issue Feb 16, 2021
when the input buffer has an end-of-stream marker right at the
beginning, decompress() will return StreamEnd and total_in will not
advance. We cannot return Ok(read) as this would signal EOF. Instead, we
rely on the next loop iteration to really return EOF when the input
buffer did not fill again.
afflux added a commit to afflux/bzip2-rs that referenced this issue Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants