-
Notifications
You must be signed in to change notification settings - Fork 101
unimplemented and slow extraction of solid archives containing many entries #21
Comments
I'm looking at it now, it seems that it has the library (because maybe it has to?) extract a range in order to get to the right one, and that perhaps the overall strategy for it needs to be reworked to fix the slowdown, maybe catering to the library more and sacrificing the ability to skip certain files? I'll tinker more and post again. |
I managed to hack the code to make this work. before: 06:02:11.8560134 I may make a PR for this, it's rather hacky but it's worth a shot. |
here it is if you want to see: https://github.com/fartwhif/SevenZipSharp/tree/speed I'm going to test further before making a PR |
I'll have a closer look a little later, but spontaneously I'm hesitant to change the public method signatures (but I haven't checked enough to see if it's possible to avoid). I do like the speed increase, and will probably test it myself, do you have an example archive to test against? Or at least specs on the one you used (number of files and size)? |
the test was done with three threads each doing a different 7z file. (not volume archives) each 7z contains thousands of files and all three added together is 770mb and 6984 files. One thing is for sure the massive memory usage occurring may need to be looked at (it shoots up to 2GB for a moment, not sure if that's my test apparatus' fault or not). The GC seems to do its job well with default settings, and it should be up to the caller, not SevenZipSharp, to detach memory streams so they can be cleaned up - perhaps that's all it is, the momentary lifetime of the MemoryStreams within my test apparatus causing the memory usage to skyrocket because things are moving so fast, plus there's a debugger attached so GC will be extra lax, all of this high memory usage flashes by, if a fake stream is used it would not happen i don't think. I highly doubt tinkering with the GC within SevenZipSharp is a thing that is able to change this aspect of the test.
the problem fixed here would be exacerbated if you threw together a single 7z solid of 770mb with 6984 files. the test i did i think reduced the magnitude of the improvement measured. a better test also is in order. another abstraction could be used to hide those stream classes. That's fine just add another overload please. I wouldn't expect a full rewrite so fast! an abstraction could be used to have the same signature as this: |
Before:
After, with the hack it's doing something like this
this is the caller i'm using. It has to check to see if the memorystream is null, it's important to know that (maybe it's just for folders, not sure, but perhaps not). Perhaps the previous signature could be implemented around this.
I don't know the extent of the slowdown, perhaps its only in situations where you need to extract entries from a solid 7z that contains many files totaling a few hundred MB compressed and you want all entries extracted? Perhaps the same type of problem is affecting other scenarios as well? I would definitely look into that "null stream" in the hockeystick / triangle pattern. |
Do you want me to make a PR? I'll leave up my fork so you can reference it. |
Go ahead, I haven't really had time to look into this yet and can't promise when I will. |
Using [sharpcompress] to decompress is a temporary solution for me. |
I am not 100% sure this is the same issue, but I have a 7zip file that's about 20 MB in size and has about half a million text files in it (that vary from each other in very small degrees). Unpacked, we're looking at tens of gigabytes, hence working with the pure files is impractical. My software only needs to iterate through all the files in the archive in a straight order, but with the way things are now (I think), it gets progressively slower and slower. What's also specifically important to me is to read the files into memory and not write them to disk as that would be ridiculously slow for obvious reasons. The perfect API would be one of two in my opinion: Either a "ExtractNext()" function that remembers the state of the last extraction and can continue there, and/or some kind of asynchronous(?) call that is passed a callback function that gets called every time a file is extracted, so that it could be handled with a lambda like a loop, for example: while(ExtractedFile file = extractor.ExtractNext()){
// file.FileName contains filename if necessary
byte[] data = file.getBytes();
} or extractor.SequentialExtractFiles((ExtractedFile file)=>{
// file.FileName contains filename if necessary
byte[] data = file.getBytes();
}); Edit: In the latter case, it would be ideal if the callbacks would be called in the correct order and the callback for file 2 would only get called after the callback was already called for file 1 and so on. It could (but wouldn't have to be) multithreadedly extracting the next files in the background. Maybe this could be made an option where you can specify if you care about having the callback called in the correct order, but I don't see any necessity for the option for it not to be. I realize it's probably not a terribly common problem to have but it is a big hurt. Edit 2: I can confirm sharpcompress does not have this issue. What took me 6 hours in optimized Release code with this library, sharpcompress did within a minute in Debug mode. |
I commented the first branch, so that it always uses the implementation for non solid, but:
I've been trying to extract all files from some solid 7z files, it's fast at first but the further it gets the slower it gets, slows to almost stopped, takes 6 hours instead of 1 minute or so because of this.
It's doing a triangle.
ExtractFile(Index, Stream);
exhibits the same slowdown
The text was updated successfully, but these errors were encountered: