-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkDuplicatesSpark improvements checkpoint #4656
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4656 +/- ##
==============================================
+ Coverage 79.84% 80.146% +0.305%
- Complexity 17330 17620 +290
==============================================
Files 1074 1080 +6
Lines 62907 64036 +1129
Branches 10181 10471 +290
==============================================
+ Hits 50225 51322 +1097
- Misses 8701 8704 +3
- Partials 3981 4010 +29
|
|
||
//register to avoid writing the full name of this class over and over | ||
kryo.register(PairedEnds.class, new FieldSerializer<>(kryo, PairedEnds.class)); | ||
kryo.register(PairedEnds.class, new PairedEnds.PairedEndsEmptyFragmentSerializer()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems wrong, it should be registering for each of the subclasses, not the abstract class
…he tool with smaller serializing
… what is going on
…y and suppelementary reads
606c6d5
to
8ad0671
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery some comments, not done, but if you're going to start making changes i'lll get these in now
@@ -54,13 +54,16 @@ | |||
fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME) | |||
protected String output; | |||
|
|||
@Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this-isn't-kabob-case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it should be a constant since it's duplicated a bunch of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -112,6 +112,9 @@ | |||
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true) | |||
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST; | |||
|
|||
@Argument(fullName = "do_not_mark_unmapped_mates", doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to consider a DuplicateMarkingArgumentCollection so we don't have to duplicate this all over the place
@@ -34,6 +34,7 @@ public void test() throws Exception { | |||
args.addInput(input); | |||
args.addOutput(output); | |||
args.addBooleanArgument(StandardArgumentDefinitions.DISABLE_SEQUENCE_DICT_VALIDATION_NAME, true); | |||
args.add("--do_not_mark_unmapped_mates"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
import org.broadinstitute.hellbender.utils.read.ReadUtils; | ||
|
||
public abstract class MarkDuplicatesSparkData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want a better name for this. It's super vague and confusing. I don't know what to call it though. Also, needs some javadoc here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
public abstract Type getType(); | ||
public abstract int getScore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be pushed down to PairedEnds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
final List<IndexPair<GATKRead>> primaryReads = Utils.stream(keyedRead._2()) | ||
////// Making The Fragments ////// | ||
// Make a PairedEnd object with no second read for each fragment (and an empty one for each paired read) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make our language consistent.
"Make a Fragment for each read which has no mapped mate, and a placeholder for each that does."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should update this comment
.partitionBy(new KnownIndexPartitioner(reads.getNumPartitions())) | ||
.values(); | ||
|
||
return reads.zipPartitions(repartitionedReadNames, (readsIter, readNamesIter) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a major bug here:
I think if we have a non-queryname / query-group sorted bam, the zip partitions will fail completely if there are multiple shards because we don't re-sort into matching partitions the second time around. We can fix that by pulling pulling the step that prepares the reads rdd out of the transformToDuplicateNames
and then reusing the intermediate sorted RDD on the writeout step (probably want to cache it as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, as we have discussed, the problem here is that it fails to route the duplicate-marking data to the correct partition for all reads in a group if they are not in the same place to start out. So you are libel to mark as duplicates reads that should not be duplicates in a group incorrectly some of the time. I'm going to elect to fix this by just keeping a list of indexes of interest in the MarkDuplciatesSparkData object and just making sure the data gets duplicated properly during the zip, as opposed to sorting the bam before the mark phase as that is likely to have significant performance impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #4701, for this issue to expedite this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
JavaPairRDD<String, GATKRead> keyReadPairs = reads.mapToPair(read -> new Tuple2<>(ReadsKey.keyForRead(header, read), read)); | ||
keyedReads = keyReadPairs.groupByKey(numReducers); | ||
} | ||
static JavaPairRDD<IndexPair<String>, Integer> transformToDuplicateNames(final SAMFileHeader header, final MarkDuplicatesScoringStrategy scoringStrategy, final OpticalDuplicateFinder finder, final JavaRDD<GATKRead> reads, final int numReducers) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a bug that will break our assumptions here I think. If we have bam that is query grouped, but not query sorted, which is the usual case, we won't do the adjustment for pairs at the edge of shards. We can fix that with a simple change to the check in ReadsSparkSource.putPairsInSamePartition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// Mark duplicates cant properly handle templates with more than two reads in a pair | ||
if (primaryReads.size()>2) { | ||
throw new GATKException(String.format("Readgroup containing read %s has more than two primary reads, this is not valid", primaryReads.get(0).getValue())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a UserException.UnsupportedFeature probably instead of a GATKException. It's possible in a valid bam, it's just not something we allow. The message is a bit confusing. Read group is an overloaded term. Maybe "MarkDuplicatesSpark only supports singleton fragments and pairs. We found a group with >2 primary reads. ( %d number of reads). \list all reads one line at a time. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
}).groupByKey(numReducers); | ||
}); | ||
|
||
final JavaPairRDD<Integer, Iterable<MarkDuplicatesSparkData>> keyedPairs = pairedEnds.groupByKey(); //TODO make this a proper aggregate by key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets get rid of this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…o first see if it works
@@ -112,6 +112,9 @@ | |||
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true) | |||
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST; | |||
|
|||
@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be an advanced argument or a not recommend one or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, how "advanced" is matching gatk3 vs not matching gatk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very advanced. We need a "not recommended" argument option
final Map<String,Integer> namesOfNonDuplicateReadsAndOpticalCounts = Utils.stream(readNamesIter).collect(Collectors.toMap(Tuple2::_1,Tuple2::_2)); | ||
return Utils.stream(readsIter).peek(read -> { | ||
// Handle reads that have been marked as non-duplicates (which also get tagged with optical duplicate summary statistics) | ||
if( namesOfNonDuplicateReadsAndOpticalCounts.containsKey(read.getName())) { //todo figure out if we should be marking the unmapped mates of duplicate reads as duplicates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
import javax.validation.constraints.Max; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this import seems spurious
// Place all the reads into a single RDD of MarkDuplicatesSparkRecord objects | ||
final JavaPairRDD<Integer, MarkDuplicatesSparkRecord> pairedEnds = keyedReads.flatMapToPair(keyedRead -> { | ||
final List<Tuple2<Integer, MarkDuplicatesSparkRecord>> out = Lists.newArrayList(); | ||
AtomicReference<IndexPair<GATKRead>> hadNonPrimaryRead = new AtomicReference<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to not use an atomic reference here, they have a lot of overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
|
||
static JavaPairRDD<String, Iterable<GATKRead>> spanReadsByKey(final SAMFileHeader header, final JavaRDD<GATKRead> reads) { | ||
JavaPairRDD<String, GATKRead> nameReadPairs = reads.mapToPair(read -> new Tuple2<>(read.getName(), read)); | ||
//todo use this instead of keeping all unmapped reads as non-duplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we remove this?
} | ||
// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data | ||
private static long getGroupKey(MarkDuplicatesSparkRecord record) { | ||
return record.getClass()==Passthrough.class?-1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you open an issue to resolve the readgroup issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes #4700
// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data | ||
private static long getGroupKey(MarkDuplicatesSparkRecord record) { | ||
return record.getClass()==Passthrough.class?-1: | ||
(((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 | |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
return record.getClass()==Passthrough.class?-1: | ||
(((long)((PairedEnds)record).getUnclippedStartPosition()) << 32 | | ||
((PairedEnds)record).getFirstRefIndex() << 16 ); | ||
//| ((PairedEnds)pe).getLibraryIndex())).values(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dead code
* with the same name. TODO: explain why there might be more. | ||
* (3) keyMarkDuplicatesSparkRecords with alignment info: | ||
* (a) Generate a fragment or emptyFragment from each read if its unpaired. | ||
* (b) Pair grouped reads reads into MarkDuplicatesSparkRecord. In most cases there will only be two reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate reads
return handleFragments(pairedEnds, scoringStrategy, header).iterator(); | ||
} | ||
/** | ||
* Primary landing point for MarkDulicateSparkRecords: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo MarkDulicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments for you @jamesemery
return samRecord.getTransientAttribute(key); | ||
} | ||
|
||
public void setTransientAttribute(Object key, Object value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for these two new methods, with appropriate disclaimers about use of these methods, and an explanation of what the transient attributes are.
* @param collection any Collection | ||
* @return true if the collection exists and has elements | ||
*/ | ||
public static boolean hasElements(Collection<?> collection){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isNonEmpty()
or isNonEmptyCollection()
might be a better names.
@@ -214,7 +214,7 @@ public boolean accept(Path path) { | |||
* so they are processed together. No shuffle is needed. | |||
*/ | |||
JavaRDD<GATKRead> putPairsInSamePartition(final SAMFileHeader header, final JavaRDD<GATKRead> reads) { | |||
if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname)) { | |||
if (!header.getSortOrder().equals(SAMFileHeader.SortOrder.queryname) && !SAMFileHeader.GroupOrder.query.equals(header.getGroupOrder())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the GroupOrder vs. SortOrder check here?
@@ -54,13 +54,16 @@ | |||
fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME) | |||
protected String output; | |||
|
|||
@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.") | |||
public boolean dontMarkUnmappedMates = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this argument and duplicates_scoring_strategy
into a MarkDuplicatesSparkArgumentCollection
@@ -112,6 +112,9 @@ | |||
@Argument(doc = "the join strategy for reference bases and known variants", fullName = "join-strategy", optional = true) | |||
private JoinStrategy joinStrategy = JoinStrategy.BROADCAST; | |||
|
|||
@Argument(fullName = MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES, doc = "Enabling this option will mean unmapped mates of duplicate marked reads will not be marked as duplicates.") | |||
public boolean dontMarkUnmappedMates = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move to argument collection, as noted above.
} | ||
} | ||
return reads; | ||
private static Tuple2<IndexPair<String>, Integer> handleFragments(List<MarkDuplicatesSparkRecord> duplicateFragmentGroup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All methods in this class should have basic javadoc with at least an explanation of the method's purpose.
private static final long serialVersionUID = 1l; | ||
private final SAMFileHeader header; | ||
// TODO: Unify with other comparators in the codebase | ||
public static final class PairedEndsCoordinateComparator implements Comparator<PairedEnds>, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would probably be better to have this comparator live in a standalone class in the same package instead of embedded in this utils class like this.
* the same location. | ||
* | ||
* Ordering is almost identical to the {@link htsjdk.samtools.SAMRecordCoordinateComparator}, | ||
* modulo a few subtle differences in tie-breaking rules for reads that share the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say that the ordering is almost identical to the SAMRecordCoordinateComparator
, but it looks like you've stripped out most of the tie-breaking.
JavaPairRDD<Integer, GATKRead> firstKeyed = firstReads.mapToPair(read -> new Tuple2<>(ReadsKey.hashKeyForFragment( | ||
ReadUtils.getStrandedUnclippedStart( | ||
read), | ||
read.isReverseStrand(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange formatting here...
int key = library != null ? library.hashCode() : 1; | ||
key = key * 31 + referenceIndex; | ||
key = key * 31 + strandedUnclippedStart; | ||
return key * 31 + (reverseStrand ? 0 : 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment explaining why you took this strategy for the key generation
@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) { | |||
return new Tuple2<>(key, ImmutableList.copyOf(reads)); | |||
} | |||
|
|||
@Test (enabled = false) | |||
public void testSortOrderParitioningCorrectness() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo paritioning
@@ -89,4 +98,51 @@ private String getReadGroupId(final SAMFileHeader header, final int index) { | |||
return new Tuple2<>(key, ImmutableList.copyOf(reads)); | |||
} | |||
|
|||
@Test (enabled = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move this into the other branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its in the other branch now
@@ -38,7 +43,7 @@ public void testSpanningIterator() { | |||
ImmutableList.of(pairIterable(1, "a"), pairIterable(2, "b"), pairIterable(1, "c"))); | |||
} | |||
|
|||
@Test(groups = "spark") | |||
@Test(groups = "spark",enabled = false) //TODO discuss with reviewer what to do about this test. perhaps the readgroups should still be used in the name? | |||
public void testSpanReadsByKeyWithAlternatingGroups() { | |||
SAMFileHeader header = ArtificialReadUtils.createArtificialSamHeaderWithGroups(1, 1, 1000, 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we change this to be names instead of read groups in the test?
second.isReverseStrand() ? "r" : "f"); | ||
key = 31 * key + ReadUtils.getReferenceIndex(second, header); | ||
key = 31 * key + ReadUtils.getStrandedUnclippedStart(second); | ||
return 31 * key + (second.isReverseStrand() ? 0 : 1); | ||
} | ||
|
||
/** | ||
* Makes a unique key for the read. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the reason why the read group isn't needed here.
private final transient GATKRead read; | ||
|
||
Passthrough(GATKRead read, int partitionIndex) { | ||
super(partitionIndex, read.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should make the key at construction time instead of holding a transient reference to the read (can do in a separate PR, however).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@DefaultSerializer(Pair.Serializer.class) | ||
public final class Pair extends PairedEnds implements OpticalDuplicateFinder.PhysicalLocation { | ||
protected transient GATKRead first; | ||
protected transient GATKRead second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document why these can be transient (as it's not obvious that this is safe).
* processing on the reads. (eg. unmapped reads we don't want to process but must be non-duplicate marked) | ||
*/ | ||
public final class Passthrough extends MarkDuplicatesSparkRecord { | ||
private final transient GATKRead read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you document why these can be transient (as it's not obvious that this is safe).
@@ -121,6 +122,7 @@ public void testReadsPipelineSpark(PipelineTest params) throws IOException { | |||
args.add("-R"); | |||
args.add(referenceFile.getAbsolutePath()); | |||
} | |||
args.add("--"+ MarkDuplicatesSpark.DO_NOT_MARK_UNMAPPED_MATES); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have tests that cover the case where this argument is not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the corresponding tests (with the same names and arguments in AbstractMarkDuplicatesTester)
@@ -158,4 +158,24 @@ public void testMarkDuplicatesSparkIntegrationTestLocal( | |||
} | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have tests that take name-sorted input? Can you create a ticket to add more tests that use name-sorted input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in another PR
.filter(indexPair -> !(indexPair.getValue().isSecondaryAlignment()||indexPair.getValue().isSupplementaryAlignment())) | ||
.collect(Collectors.toList()); | ||
|
||
// Mark duplicates cant properly handle templates with more than two reads in a pair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing '
|
||
import java.util.Random; | ||
|
||
public class ReadsKeyUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All test classes should extend GATKBaseTest
Co-authored-by: Louis Bergelson <[email protected]> First part of a major rewrite of MarkDuplicatesSpark to improve performance. Tool still has a number of known issues, but is much faster than the previous version.
Co-authored-by: Louis Bergelson <[email protected]> First part of a major rewrite of MarkDuplicatesSpark to improve performance. Tool still has a number of known issues, but is much faster than the previous version.
This PR is the culmination of work from myself and @lbergelson to improve the runtime for MarkDuplicatesSpark on a single machine. This involved a rewrite of the tool as well as a number of improvements which should bring it into closer agreement with MarkDuplicates from picard.
Note: this is merely a checkpoint and there is still work that must be done to bring the work into agreement with recent MarkDuplicates development in picard.
Resolves #3706