Reduce SVConcordance memory footprint #8623
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The SVConcordance tool is currently too inefficient in terms of memory usage, requiring several 100's of GB of heap space on ~100K samples. This PR aims to reduce memory usage in two ways:
GT
andCN
, which are necessary and sufficient for concordance computations.--do-not-sort
is introduced to skip output record sorting. A major source of heap usage is the output buffer in theClosestSVFinder
class, which ensures records are emitted in coordinate-sorted order. This buffer quickly fills, however, when there is at least one record being actively clustered that spans a large interval because the buffer cannot be flushed until a variant beyond the maximal clusterable coordinate of that large variant is encountered. This option will allow users to substantially reduce max heap usage on larger call sets (a single SVRecord can consume ~100MB with 100K samples).Includes an integration test to cover the
--do-not-sort
functionality.