unimplemented and slow extraction of solid archives containing many entries #21

fartwhif · 2018-09-08T01:17:24Z

public void ExtractFiles(ExtractFileCallback extractFileCallback)
{
DisposedCheck();
InitArchiveFileData(false);
if (IsSolid)
{
	// solid strategy
}
else
{

I commented the first branch, so that it always uses the implementation for non solid, but:

I've been trying to extract all files from some solid 7z files, it's fast at first but the further it gets the slower it gets, slows to almost stopped, takes 6 hours instead of 1 minute or so because of this.

It's doing a triangle.

ExtractFile(Index, Stream);

exhibits the same slowdown

The text was updated successfully, but these errors were encountered:

fartwhif · 2018-09-08T01:31:38Z

I'm looking at it now, it seems that it has the library (because maybe it has to?) extract a range in order to get to the right one, and that perhaps the overall strategy for it needs to be reworked to fix the slowdown, maybe catering to the library more and sacrificing the ability to skip certain files? I'll tinker more and post again.

fartwhif · 2018-09-08T04:12:26Z

I managed to hack the code to make this work.

before: 06:02:11.8560134
after: 00:00:12.6008404
that's about 1810 times the speed

I may make a PR for this, it's rather hacky but it's worth a shot.

fartwhif · 2018-09-08T04:40:11Z

here it is if you want to see: https://github.com/fartwhif/SevenZipSharp/tree/speed

I'm going to test further before making a PR

squid-box · 2018-09-08T10:34:16Z

I'll have a closer look a little later, but spontaneously I'm hesitant to change the public method signatures (but I haven't checked enough to see if it's possible to avoid).

I do like the speed increase, and will probably test it myself, do you have an example archive to test against? Or at least specs on the one you used (number of files and size)?

fartwhif · 2018-09-08T14:26:57Z

the test was done with three threads each doing a different 7z file. (not volume archives) each 7z contains thousands of files and all three added together is 770mb and 6984 files. One thing is for sure the massive memory usage occurring may need to be looked at (it shoots up to 2GB for a moment, not sure if that's my test apparatus' fault or not). The GC seems to do its job well with default settings, and it should be up to the caller, not SevenZipSharp, to detach memory streams so they can be cleaned up - perhaps that's all it is, the momentary lifetime of the MemoryStreams within my test apparatus causing the memory usage to skyrocket because things are moving so fast, plus there's a debugger attached so GC will be extra lax, all of this high memory usage flashes by, if a fake stream is used it would not happen i don't think. I highly doubt tinkering with the GC within SevenZipSharp is a thing that is able to change this aspect of the test.

the hockeystick / triangle pattern

the problem fixed here would be exacerbated if you threw together a single 7z solid of 770mb with 6984 files. the test i did i think reduced the magnitude of the improvement measured. a better test also is in order. another abstraction could be used to hide those stream classes. That's fine just add another overload please. I wouldn't expect a full rewrite so fast! an abstraction could be used to have the same signature as this: public void ExtractFiles(ExtractFileCallback extractFileCallback) as i understand it if the underlying stream is null then 7z skips the extraction

fartwhif · 2018-09-08T15:22:13Z

Before:

the hockeystick / triangle pattern

outer loop for (N = 0; N < indexes.length; N++){
	inner loop from within 7z for (G = 0; G < N; G++){
		if (G == N){
			assign my stream
		}
		else { assign fake stream, so i could be mistaken but
		 i think 7z is actually extracting uninteresting entries 
		because it's not a null and is a "null stream".  
		i could very well be wrong about this }
	}
}
to obtain every entry in a solid by using an outer loop to infer the callback index
, while it's extracting uninteresting entries, you can imagine why it's 1810x slower 
than the direct method.	and moderately slower for single entry extractions

After, with the hack it's doing something like this Func<OutStreamWrapper> getStream has to be added to many signatures to push 7z callback all the way to the caller so that the outer loop can be eliminated.

the direct pattern

tell 7z to extract 0 through indexes.length (all entries)

GetArchiveExtractCallback works for this too {
	callback from 7z
	caller assigns stream

	callback from 7z
	tell caller entry is extracted to stream and entry details
}

this is the caller i'm using. It has to check to see if the memorystream is null, it's important to know that (maybe it's just for folders, not sure, but perhaps not). Perhaps the previous signature could be implemented around this.

OutStreamWrapper osw = null;
MemoryStream ms = null;
Func<OutStreamWrapper> getStream = () =>
{
	ms = new MemoryStream();
	osw = new OutStreamWrapper(ms, false);
	osw.BytesWritten += (source, e) =>
	{
		//bw.Write(e.Value);
	};
	return osw;
};
Action<FileInfoEventArgs> fileExtractStart = new Action<FileInfoEventArgs>((args) =>
{

});
Action<FileInfoEventArgs> fileExtractComplete = new Action<FileInfoEventArgs>((args) =>
{
	if (ms != null && !args.FileInfo.IsDirectory && args.FileInfo.FileName != "[no name]")
	{
		Result res = new Result()
		{
			ArchiveIndex = args.FileInfo.Index,
			FileStream = ms,
			Path = args.FileInfo.FileName,
			Created = args.FileInfo.CreationTime,
			Modified = args.FileInfo.LastWriteTime,
			Length = (long)args.FileInfo.Size
		};
		using (MD5 md5 = MD5.Create())
		{
			res.FileStream.Seek(0, SeekOrigin.Begin);
			byte[] hash = md5.ComputeHash(res.FileStream);
			res.Md5 = hash.AsHex();
			res.Md5Split = hash.Md5Split();
			GetAllChecksums(res);
			//Global.IncrementCounterAndDisplay();
			res.FileStream.Dispose();
			res.FileStream = null;
			result.Files.Add(res);
		}
	}
});
arc.ExtractFiles(null, getStream, fileExtractStart, fileExtractComplete);

I don't know the extent of the slowdown, perhaps its only in situations where you need to extract entries from a solid 7z that contains many files totaling a few hundred MB compressed and you want all entries extracted? Perhaps the same type of problem is affecting other scenarios as well? I would definitely look into that "null stream" in the hockeystick / triangle pattern.

fartwhif · 2018-09-09T00:43:15Z

Do you want me to make a PR? I'll leave up my fork so you can reference it.

squid-box · 2018-09-09T05:53:10Z

Go ahead, I haven't really had time to look into this yet and can't promise when I will.

TimzChen · 2020-08-09T16:22:34Z

Using [sharpcompress] to decompress is a temporary solution for me.

TomArrow · 2021-04-24T04:41:18Z

I am not 100% sure this is the same issue, but I have a 7zip file that's about 20 MB in size and has about half a million text files in it (that vary from each other in very small degrees). Unpacked, we're looking at tens of gigabytes, hence working with the pure files is impractical. My software only needs to iterate through all the files in the archive in a straight order, but with the way things are now (I think), it gets progressively slower and slower.

What's also specifically important to me is to read the files into memory and not write them to disk as that would be ridiculously slow for obvious reasons.

The perfect API would be one of two in my opinion: Either a "ExtractNext()" function that remembers the state of the last extraction and can continue there, and/or some kind of asynchronous(?) call that is passed a callback function that gets called every time a file is extracted, so that it could be handled with a lambda like a loop, for example:

while(ExtractedFile file = extractor.ExtractNext()){
    // file.FileName contains filename if necessary
    byte[] data = file.getBytes();
}

or

extractor.SequentialExtractFiles((ExtractedFile file)=>{
    // file.FileName contains filename if necessary
    byte[] data = file.getBytes();
});

Edit: In the latter case, it would be ideal if the callbacks would be called in the correct order and the callback for file 2 would only get called after the callback was already called for file 1 and so on. It could (but wouldn't have to be) multithreadedly extracting the next files in the background. Maybe this could be made an option where you can specify if you care about having the callback called in the correct order, but I don't see any necessity for the option for it not to be.

I realize it's probably not a terribly common problem to have but it is a big hurt.

Edit 2: I can confirm sharpcompress does not have this issue. What took me 6 hours in optimized Release code with this library, sharpcompress did within a minute in Debug mode.

fartwhif changed the title ~~Solid archive extraction method is missing~~ unfinished and slow for solid archives Sep 8, 2018

fartwhif changed the title ~~unfinished and slow for solid archives~~ unfinished and slow extraction of solid archives Sep 8, 2018

fartwhif changed the title ~~unfinished and slow extraction of solid archives~~ unfinished and slow extraction of solid archives containing many files Sep 8, 2018

fartwhif changed the title ~~unfinished and slow extraction of solid archives containing many files~~ slow extraction of solid archives containing many files Sep 8, 2018

fartwhif changed the title ~~slow extraction of solid archives containing many files~~ unimplemented and slow extraction of solid archives containing many files Sep 8, 2018

fartwhif changed the title ~~unimplemented and slow extraction of solid archives containing many files~~ unimplemented and slow extraction of solid archives containing many entries Sep 8, 2018

squid-box added the performance Performance related issues label Sep 8, 2018

fartwhif mentioned this issue Sep 9, 2018

Work in progress: 1810x increase in extraction speed for solid archives. Fixes issue #21 #22

Open

doug24 mentioned this issue Apr 11, 2022

Should archives be marked as Solid by default? #161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unimplemented and slow extraction of solid archives containing many entries #21

unimplemented and slow extraction of solid archives containing many entries #21

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018

squid-box commented Sep 8, 2018

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 9, 2018

squid-box commented Sep 9, 2018

TimzChen commented Aug 9, 2020

TomArrow commented Apr 24, 2021 •

edited

Loading

unimplemented and slow extraction of solid archives containing many entries #21

unimplemented and slow extraction of solid archives containing many entries #21

Comments

fartwhif commented Sep 8, 2018 • edited Loading

fartwhif commented Sep 8, 2018 • edited Loading

fartwhif commented Sep 8, 2018 • edited Loading

fartwhif commented Sep 8, 2018

squid-box commented Sep 8, 2018

fartwhif commented Sep 8, 2018 • edited Loading

fartwhif commented Sep 8, 2018 • edited Loading

fartwhif commented Sep 9, 2018

squid-box commented Sep 9, 2018

TimzChen commented Aug 9, 2020

TomArrow commented Apr 24, 2021 • edited Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

fartwhif commented Sep 8, 2018 •

edited

Loading

TomArrow commented Apr 24, 2021 •

edited

Loading