update `GatherResult` to no longer store MinHashes #2950

ctb · 2024-01-27T14:20:33Z

over in https://github.com/ctb/2024-calc-full-gather/ I have implemented a simple script that takes fastgather output (from https://github.com/sourmash-bio/sourmash_plugin_branchwater/) and turns it into full gather output without redoing the searches - it literally just trusts the rank and match information from fastgather completely, and calculates all the stats.

this was easier than I expected because of the very nice GatherResult refactoring that @bluegenes did a while back in #1955!

however it also revealed that #1955 probably added significantly to the memory footprint of gather, because the GatherResult dataclasses keep sketches in memory and they are retained throughout the full gather process.

I figured this out when I noticed that my calc-full-gather script was running out of memory in the same way that gather was running out of memory, and in ctb/2024-calc-full-gather@a09215e I fixed it by discarding the GatherResult objects after each result. It's now nice and low memory (if not exactly fast ;) - see #2943.

I am also wondering if perhaps PrefetchResult has the same problem in prefetch?

We should fix the gather code in sourmash to be lower memory.

We probably need to do some kind of regression testing that tracks memory usage and the like, too.

viz sourmash-bio/sourmash_plugin_branchwater#187, #2943.

The text was updated successfully, but these errors were encountered:

ctb · 2024-01-28T23:07:29Z

note connection to suggestions at bottom of #416

ctb · 2024-01-30T19:17:06Z

I have questions - are the dataclasses caching results, or recomputing them multiple times?? -- and the people demand answers!

ctb · 2024-01-30T21:50:32Z

OK, #2962 tackles this for just sourmash gather and multigather

ctb · 2024-01-30T21:53:16Z

PrefetchResult is immediately released, so sourmash prefetch doesn't suffer from this problem.

looks like SearchResult may run afoul of this, however. But it's used in relatively minimal ways so far.

…#2962) This is kind of a patch-fix for #2950 for `sourmash gather` specifically. This PR changes `sourmash gather` and `sourmash multigather` so that they no longer store any `GatherResult` objects, thus decreasing memory usage substantially. The solution is hacky at several levels, including storing a CSV file in memory rather than writing it progressively. But I think it's an important fix to get in, since `gather` is one of our main use cases and it's causing people some problems (including me) :(. The PR also changes `--save-matches` so that it writes out sketches as they are encountered. This breaks semantic versioning a little bit because the target file for `--save-matches` is opened before any matches are found, and thus may be empty and may also overwrite files unnecessarily. Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment. ## benchmarking with latest @ e2c199f: 645 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 48.51 System time (seconds): 1.15 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:49.91 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 644900 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 156 Minor (reclaiming a frame) page faults: 254494 Voluntary context switches: 2412 Involuntary context switches: 2749 Swaps: 0 File system inputs: 31488 File system outputs: 64 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` with this branch: 215 MB ``` Command being timed: "sourmash gather /home/ctbrown/transfer/SRR606249.trim.k31.sig.gz /home/ctbrown/transfer/podar-ref.zip -o xxx.csv" User time (seconds): 43.38 System time (seconds): 0.89 Percent of CPU this job got: 97% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:45.58 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 215560 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 773 Minor (reclaiming a frame) page faults: 148722 Voluntary context switches: 3884 Involuntary context switches: 6174 Swaps: 0 File system inputs: 151648 File system outputs: 160 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ```

ctb · 2024-01-31T19:04:08Z

#2962 addresses the memory usage, but not the underlying problem. From the PR:

Ultimately, a better fix is needed - probably one that changes up the dataclasses so that they don't store MinHashes - but such a fix is beyond me at the moment.

ctb mentioned this issue Jan 28, 2024

should sourmash gather insist on uniform scaling? #2951

Open

ctb mentioned this issue Jan 28, 2024

EXP: switch to using calc-full-gather.py dib-lab/sourmash-slainte#18

Open

ctb mentioned this issue Jan 30, 2024

MRG: fix gather memory usage issue by not accumulating GatherResult #2962

Merged

This was referenced Feb 4, 2024

consider adding sourmash gather execution directly to fastgather as postprocessing sourmash-bio/sourmash_plugin_branchwater#107

Closed

MRG: add more columns, including ANI, by using PrefetchResult sourmash-bio/sourmash_plugin_containment_search#1

Merged

ctb mentioned this issue Feb 14, 2024

MRG: Calculate all gather stats in rust; use for rocksdb gather #2943

Merged

ctb changed the title ~~current gather code is very memory intensive; switch back to streaming?~~ update GatherResult to no longer store MinHashes Mar 5, 2024

ctb mentioned this issue Mar 15, 2024

Best practice for ONT metagenomics #3070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update `GatherResult` to no longer store MinHashes #2950

update `GatherResult` to no longer store MinHashes #2950

ctb commented Jan 27, 2024 •

edited

Loading

ctb commented Jan 28, 2024

ctb commented Jan 30, 2024

ctb commented Jan 30, 2024 •

edited

Loading

ctb commented Jan 30, 2024 •

edited

Loading

ctb commented Jan 31, 2024

update GatherResult to no longer store MinHashes #2950

update GatherResult to no longer store MinHashes #2950

Comments

ctb commented Jan 27, 2024 • edited Loading

ctb commented Jan 28, 2024

ctb commented Jan 30, 2024

ctb commented Jan 30, 2024 • edited Loading

ctb commented Jan 30, 2024 • edited Loading

ctb commented Jan 31, 2024

update `GatherResult` to no longer store MinHashes #2950

update `GatherResult` to no longer store MinHashes #2950

ctb commented Jan 27, 2024 •

edited

Loading

ctb commented Jan 30, 2024 •

edited

Loading

ctb commented Jan 30, 2024 •

edited

Loading