-
Notifications
You must be signed in to change notification settings - Fork 592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing step for RSEM #7752
Changes from 4 commits
1e8b28b
a9a5b62
c5ba782
ea73824
95bd053
a2bb2a6
2f7d12d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,257 @@ | ||
package org.broadinstitute.hellbender.tools.walkers.qc; | ||
|
||
import htsjdk.samtools.CigarElement; | ||
import htsjdk.samtools.CigarOperator; | ||
import htsjdk.samtools.SAMFileHeader; | ||
import htsjdk.samtools.util.PeekableIterator; | ||
import org.apache.commons.lang3.tuple.ImmutablePair; | ||
import org.apache.commons.lang3.tuple.Pair; | ||
import org.broadinstitute.barclay.argparser.Argument; | ||
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties; | ||
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions; | ||
import org.broadinstitute.hellbender.engine.*; | ||
import org.broadinstitute.hellbender.exceptions.UserException; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
import org.broadinstitute.hellbender.utils.read.SAMFileGATKReadWriter; | ||
import picard.cmdline.programgroups.ReadDataManipulationProgramGroup; | ||
|
||
import java.io.File; | ||
import java.util.*; | ||
import java.util.stream.Collectors; | ||
|
||
/** | ||
* Performs post-processing steps to get a bam aligned to a transcriptome ready for RSEM (https://github.com/deweylab/RSEM) | ||
* | ||
* ### Task 1. Reordering Reads ### | ||
* | ||
* Suppose the queryname "Q1" aligns to multiple loci in the transcriptome. | ||
* STAR aligner outputs the reads in the following order: | ||
* | ||
* Q1: Read1 (Chr 1:1000) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just be clear, you want the reads sorted first by readname and then subsorted with position as the first tiebreaker instead of first in pair/second of pair? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I misread this the first time through because I interpreted Read1 as the name of a read. I might clarify it somehow. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed Read1 to First-of-pair |
||
* Q1: Read2 (Chr 1:2000) | ||
* Q1: Read1 (Chr 20:5000) | ||
* Q1: Read2 (Chr 20:6000) | ||
* | ||
* This is the format required by RSEM. After query-name sorting the reads for duplicate marking, | ||
* the reads will be ordered as follows; | ||
* | ||
* Q1: Read1 (Chr 1:1000) | ||
* Q1: Read1 (Chr 20:5000) | ||
* Q1: Read2 (Chr 1:2000) | ||
* Q1: Read2 (Chr 20:6000) | ||
* | ||
* That is, all the read1's come first, then read2's. | ||
* | ||
* This tool reorders such that the alignment pair appears together as in the first example. | ||
* | ||
* ### Task 2. Removing Reads ### | ||
* | ||
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand this set of comments. This says it can remove them and then caveat says it doesn't. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. contradiction removed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This can be done by MarkDuplicates itself, (or a read filter) unless I misunderstand what you're doing. If the reads are aligned can't you use -XL to screen out the midochondrial reads? It's fine to bake these things into the tool to make it foolproof but I'm not clear why it's necessary to do specially. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the hardcoded list of mitochondrial transcripts below it seems like it would be better to pass in an exclusion file with -XL to remove mitocondria instead of hardcoding the list. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed hardcoded MT contigs and created an interval list. |
||
* | ||
* | ||
* Caveat: This tool does not remove duplicate reads; it assumes it's been removed upstream e.g. | ||
* MarkDuplicates with R | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mark duplicates with R? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happened was I left IntelliJ to look up what the exact argument name was (it's REMOVE_DUPLICATES) and then got distracted and went on to other places in the code. |
||
* | ||
*/ | ||
@CommandLineProgramProperties( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs summaries, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
summary = "", | ||
oneLineSummary = "", | ||
programGroup = ReadDataManipulationProgramGroup.class // Sato: Change to QC when the other PR is merged. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I commented in the other pr but this seems better to me than QC? This isn't really checking anything. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keeping ReadDataManipulation |
||
) | ||
public class PostProcessReadsForRSEM extends GATKTool { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need for beta and experimental, those are sort of mutually exclusive. Beta means "this is a first of something that's going to become a real tool" while experimental is "this thing is weird and may never be supported" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This tool needs to override There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME) | ||
public File outSam; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prefer GATKPath unless there's a strong reason not to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
|
||
@Argument(fullName = "keep-MT-reads") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think lowercase all our full names. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good to know, this option is now removed |
||
public boolean keepMTReads = false; | ||
|
||
PeekableIterator<GATKRead> read1Iterator; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be moved into traverse and initialized there. It might simplify the special case handling of the first pair as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved into traverse, still need to initialize the first pair |
||
SAMFileGATKReadWriter writer; | ||
|
||
ReadPair currentReadPair; | ||
|
||
// Use CountingFilter or another existing framework for counting and printing filtered reads | ||
int totalOutputPair = 0; | ||
int notBothMapped = 0; | ||
int chimera = 0; | ||
int unsupportedCigar = 0; | ||
int duplicates = 0; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this isn't used There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
int mitochondria = 0; | ||
|
||
@Override | ||
public void onTraversalStart(){ | ||
read1Iterator = new PeekableIterator<>(directlyAccessEngineReadsDataSource().iterator()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a good idea. If you do this you will ignore any filters / transformers. You probably mean to call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice I did not know this was the recommended way but now I do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why does this need to be Peekable if you never peek it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah good catch I must've stopped peeking at some point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This name is confusing since there is no read2Iterator. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bad a result of copy+paste |
||
|
||
if (directlyAccessEngineReadsDataSource().getHeader().getSortOrder() != SAMFileHeader.SortOrder.queryname){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not use getHeaderForReads()? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
throw new UserException("Input must be query-name sorted."); | ||
} | ||
|
||
writer = createSAMWriter(new GATKPath(outSam.getAbsolutePath()), true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just make the GATKPath the argument. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yup |
||
|
||
if (!read1Iterator.hasNext()){ | ||
throw new UserException("Input has no reads"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you sure that you want to explode on an empty input? It definitely might be the right choice but often people want to be able to silently process empty files through the pipeline. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call I like it that way, fixed. |
||
} | ||
|
||
currentReadPair = new ReadPair(read1Iterator.next()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could also be made local to traverse and the last one is just handled out of the loop. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great point |
||
totalOutputPair++; | ||
} | ||
|
||
@Override | ||
public void traverse() { | ||
while (read1Iterator.hasNext()){ | ||
final GATKRead read = read1Iterator.next(); | ||
if (!currentReadPair.getQueryName().equals(read.getName())){ | ||
// We have gathered all the reads with the same query name (primary, secondary, and supplementary alignments) | ||
// Write the reads to output file, reordering the reads as needed. | ||
writeReads(currentReadPair); | ||
currentReadPair = new ReadPair(read); | ||
} else { | ||
currentReadPair.add(read); | ||
} | ||
progressMeter.update(read); | ||
} | ||
} | ||
|
||
public boolean passesRSEMFilter(final GATKRead read1, final GATKRead read2) { | ||
// If either of the pair is unmapped, throw out. | ||
// With the STAR argument --quantTranscriptomeBan IndelSoftclipSingleend this should not occur, | ||
// but we check just to be thorough. | ||
if (read1.getContig() == null || read2.getContig() == null){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. might be slightly cleaner to check There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If either read1 or read2 is missing this will crash with a NPE. That's probably something that should be handled since it tends to happen. A crash is a fine answer but detect it and write a sane error message. Since it seems like you do require both a first and second in pair, one way to simplify some things would be to push these checks into ReadFilters and then at this point you'll just check if there is a full set of reads. It would change the counting of the elimination reason though from being per fragment group to being by read which might not be what you want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This doesn't seem to happen. But I'm going to invoke the @droazen rule and do the pragmatic thing, which is that if either read1 or read2 is null, we return false (so these reads will not be output). |
||
notBothMapped++; | ||
return false; | ||
} | ||
|
||
// Chimeric reads are not allowed | ||
if (!read1.getContig().equals(read2.getContig())){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another minor thing, you could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done—not sure I understand your question. Did I maybe answer it yesterday? |
||
chimera++; | ||
return false; | ||
} | ||
|
||
// Cigar must have a single operator M. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Indels and clipping are not allowed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea which seems odd but I'm sure they have a good reason |
||
final List<CigarElement> cigarElements1 = read1.getCigar().getCigarElements(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. prefer There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
final List<CigarElement> cigarElements2 = read2.getCigar().getCigarElements(); | ||
|
||
if (cigarElements1.size() != 1 || cigarElements2.size() != 1){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might pull out a little function and call it twice instead of duplicating the checks. |
||
unsupportedCigar++; | ||
return false; | ||
} | ||
|
||
if (cigarElements1.get(0).getOperator() != CigarOperator.M || | ||
cigarElements2.get(0).getOperator() != CigarOperator.M){ | ||
unsupportedCigar++; | ||
return false; | ||
} else { | ||
return true; | ||
} | ||
} | ||
|
||
private final static HashSet<String> MT_TRANSCRIPT_IDs = new HashSet<>(Arrays.asList( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems very brittle. If this tool is meant to be general purpose there should be a way to supply the transcript names. If it's not meant to be general purpose it should clearly state that and what transcript model it's assuming. (and ideally detect if it's input is a mismatch. Maybe hard... possibly a neat use for a bloom filter.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now doing -XL There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would probably push this up to the top where other variables are assigned? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. removed all the code related to MT handling |
||
"ENST00000387314.1", "ENST00000389680.2", "ENST00000387342.1", "ENST00000387347.2", "ENST00000386347.1", | ||
"ENST00000361390.2", "ENST00000387365.1", "ENST00000387372.1", "ENST00000387377.1", "ENST00000361453.3", | ||
"ENST00000387382.1", "ENST00000387392.1", "ENST00000387400.1", "ENST00000387405.1", "ENST00000387409.1", | ||
"ENST00000361624.2", "ENST00000387416.2", "ENST00000387419.1", "ENST00000361739.1", "ENST00000387421.1", | ||
"ENST00000361851.1", "ENST00000361899.2", "ENST00000362079.2", "ENST00000387429.1", "ENST00000361227.2", | ||
"ENST00000387439.1", "ENST00000361335.1", "ENST00000361381.2", "ENST00000387441.1", "ENST00000387449.1", | ||
"ENST00000387456.1", "ENST00000361567.2", "ENST00000361681.2", "ENST00000387459.1", "ENST00000361789.2", | ||
"ENST00000387460.2", "ENST00000387461.2" | ||
)); // ENST00000389680.2 overlaps with the twist targets | ||
private boolean mitochondrialRead(final GATKRead read){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. static There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. rename |
||
return MT_TRANSCRIPT_IDs.contains(read.getContig()); | ||
} | ||
|
||
private boolean passesAdditionalFilters(final GATKRead r1, final GATKRead r2){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is confusingly named and the comment seems out of date. It's weird that it has side effects too. It's just counting, so not a big deal, just weird. |
||
if (!keepMTReads){ | ||
// If the read pair is duplicate, then return false | ||
if (mitochondrialRead(r1) || mitochondrialRead(r2)){ | ||
mitochondria++; | ||
return false; | ||
} | ||
} | ||
|
||
return true; | ||
} | ||
|
||
/** Write reads in this order | ||
* 1. Primary. First of pair, second of pair | ||
* 2. For each secondary. First of pair, second of pair. | ||
* **/ | ||
private void writeReads(final ReadPair readPair){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it the case that secondary reads are always generated in pairs? I thought it was common for a secondary read to be mated to the primary one. Maybe STAR enforces that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. STAR does enforce that the secondary reads are always generated in pairs. |
||
// Update read by side effect | ||
final GATKRead firstOfPair = readPair.getFirstOfPair(); | ||
final GATKRead secondOfPair = readPair.getSecondOfPair(); | ||
|
||
// Write Primary Reads. If either fails, we discard both, and we don't bother with the secondary alignments. | ||
if (passesRSEMFilter(firstOfPair, secondOfPair) && passesAdditionalFilters(firstOfPair, secondOfPair)){ | ||
writer.addRead(firstOfPair); | ||
writer.addRead(secondOfPair); | ||
totalOutputPair += 1; | ||
|
||
// Now handle secondary alignments. | ||
final List<Pair<GATKRead, GATKRead>> secondaryAlignmentPairs = groupSecondaryReads(readPair.getSecondaryAlignments()); | ||
for (Pair<GATKRead, GATKRead> mates : secondaryAlignmentPairs){ | ||
// The pair is either both written or both not written. | ||
if (passesRSEMFilter(mates.getLeft(), mates.getRight()) && | ||
passesAdditionalFilters(mates.getLeft(), mates.getRight())) { | ||
writer.addRead(mates.getLeft()); | ||
writer.addRead(mates.getRight()); | ||
totalOutputPair += 1; | ||
|
||
} | ||
} | ||
|
||
// Supplementary reads are not handled i.e. removed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add a default read filter to remove supplementary reads up front. You can do that by overriding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
} | ||
} | ||
|
||
/** | ||
* | ||
* Reorder the secondary alignments as described above. | ||
* @param secondaryAlignments may be empty. | ||
* @return a list of pair of matching reads (i.e. read1 with the corresponding read2) | ||
*/ | ||
private List<Pair<GATKRead, GATKRead>> groupSecondaryReads(final List<GATKRead> secondaryAlignments){ | ||
if (secondaryAlignments.isEmpty()){ | ||
return Collections.emptyList(); | ||
} | ||
|
||
final boolean isRead1 = true; | ||
final Map<Boolean, List<GATKRead>> groupdByRead1 = | ||
secondaryAlignments.stream().collect(Collectors.groupingBy(r -> r.isFirstOfPair())); | ||
final List<GATKRead> read1Reads = groupdByRead1.get(isRead1); | ||
final List<GATKRead> read2Reads = groupdByRead1.get(!isRead1); | ||
if(read1Reads.size() != read2Reads.size()){ | ||
logger.warn("Num read1s != num read2s among the secondary alignments; " + | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would include the number of each here. |
||
secondaryAlignments.get(0).getName()); | ||
return Collections.emptyList(); | ||
} | ||
|
||
|
||
final List<Pair<GATKRead, GATKRead>> result = new ArrayList<>(read1Reads.size()); | ||
for (GATKRead read1 : read1Reads){ | ||
final Optional<GATKRead> read2 = read2Reads.stream().filter(r -> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick, but I like to format stream operations with 1 operation per line,
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, done |
||
r.getStart() == read1.getMateStart() && r.getMateStart() == read1.getStart()).findFirst(); | ||
if (read2.isPresent()){ | ||
result.add(new ImmutablePair<>(read1, read2.get())); | ||
} else { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to also warn on the case where the are multiple matchess? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
logger.warn("Mate not found for the secondary alignment " + read1.getName()); | ||
} | ||
} | ||
return result; | ||
} | ||
|
||
@Override | ||
public Object onTraversalSuccess(){ | ||
// Write out the last set of reads | ||
writeReads(currentReadPair); | ||
logger.info("Total read pairs output: " + totalOutputPair); | ||
logger.info("Read pairs filtered due to unmapped mate: " + notBothMapped); | ||
logger.info("Read pairs filtered due to chimeric alignment: " + chimera); | ||
logger.info("Read pairs filtered due to a cigar element not supported by RSEM: " + unsupportedCigar); | ||
return "SUCCESS"; | ||
} | ||
|
||
@Override | ||
public void closeTool(){ | ||
writer.close(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs a null check. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. David's favorite check |
||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
package org.broadinstitute.hellbender.tools.walkers.qc; | ||
|
||
import org.broadinstitute.hellbender.exceptions.UserException; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
|
||
import java.util.ArrayList; | ||
import java.util.Arrays; | ||
import java.util.List; | ||
|
||
public class ReadPair { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ReadPair might not be ideally named since this is about more than just a pair. ReadsWithSameName would be more correct but awkward. Not sure what the right solution is. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. True and true…hmmmm let me think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs javadoc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might recommend making this a private subclass or a non-public class and then just ripping out all the supplementary read handling. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use this class for a different tool that might be merged soon—is it OK if I leave it public? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's fine. We can always refactor once they're both hin. |
||
private GATKRead firstOfPair; | ||
private GATKRead secondOfPair; | ||
private List<GATKRead> secondaryAlignments = new ArrayList<>(10); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is 10 a good guess for capacity here? Do we expect a ton of secondary/supplementary reads on transcriptome data. I would either leave it blank (which is effectively the same but is simpler). Set the default to something low like 2. Or lazily initialize the arrays because they will usually not be needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea good point, leaving it blank. |
||
private List<GATKRead> supplementaryAlignments = new ArrayList<>(10); // Finally understand the difference | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note to self? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was a great moment |
||
private String queryName = null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. final There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
|
||
|
||
public ReadPair() { } | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unused, remove? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes |
||
|
||
public ReadPair(final GATKRead read) { | ||
this.queryName = read.getName(); | ||
add(read); | ||
} | ||
|
||
public int size(){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. size needs javadoc because it's not clear what it's the size of. Someone could reasonably assume it's the territory covered by the reads. Or rename it to numberOfReads() or something. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
int size = 0; | ||
size = firstOfPair != null ? size + 1 : size; | ||
size = secondOfPair != null ? size + 1 : size; | ||
size += secondaryAlignments.size(); | ||
return size += supplementaryAlignments.size(); // what should we do with the supplementary alignments? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should this be answered? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea we should include it |
||
} | ||
|
||
public String getQueryName() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getName is probably nicer. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. great |
||
return queryName; | ||
} | ||
|
||
public GATKRead getFirstOfPair(){ | ||
return firstOfPair; | ||
} | ||
|
||
public GATKRead getSecondOfPair(){ | ||
return secondOfPair; | ||
} | ||
|
||
public List<GATKRead> getSecondaryAlignments() { | ||
return secondaryAlignments; | ||
} | ||
|
||
public void add(final GATKRead read) { | ||
if (this.queryName == null){ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is final and initialized in the constructor you don't have to worry about checking it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
this.queryName = read.getName(); | ||
} | ||
|
||
if (! this.queryName.equals(read.getName())){ | ||
throw new UserException("Read names do not match: " + this.queryName + " vs " + read.getName()); | ||
} | ||
|
||
if (isPrimaryAlignment(read) && read.isFirstOfPair()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should these throw if first of pair is already set? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup |
||
this.firstOfPair = read; | ||
} else if (isPrimaryAlignment(read) && read.isSecondOfPair()) { | ||
this.secondOfPair = read; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Heh, do you also want to throw if second of pair is alreayd set? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
} else if (read.isSecondaryAlignment()) { | ||
this.secondaryAlignments.add(read); | ||
} else if (read.isSupplementaryAlignment()) { | ||
this.supplementaryAlignments.add(read); | ||
} else { | ||
throw new UserException("Unknown read type"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One possible way for this to happen is a primary read that is not first of pair or second of pair which is allowed by the spec although not used in practice by any technology I know about. It would be good to print out the read as part of the message for debugging purposes, or at least the read name so you could find it easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
} | ||
} | ||
|
||
private boolean isPrimaryAlignment(final GATKRead read){ | ||
return ! read.isSecondaryAlignment() && ! read.isSupplementaryAlignment(); | ||
} | ||
|
||
public boolean isDuplicateMarked() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are partially marked groups a problem you run into? I thought that mark duplicates handled that correctly if the inputs are in queryname order? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea MarkDuplicates should be consistent here; removed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method isn't used, should it be deleted? |
||
// Doing some investigation | ||
if (firstOfPair.isDuplicate()) { | ||
// Make sure the rest is duplicate-marked | ||
if (!secondOfPair.isDuplicate() || secondaryAlignments.stream().anyMatch(r -> !r.isDuplicate())) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we care about supplementary alignment duplicate marking? It's not looked at here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably not; I think I used this function in a different branch that might be merged in the future, so I will just leave it here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i.e. it's not used by the RSEM processing tool There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, right, that makes sense. We can revisit later if necessary. |
||
throw new UserException("First of pair a duplicate but the rest is not" + secondOfPair.getName()); | ||
} | ||
} else { | ||
// Make sure the rest is not duplicate-marked | ||
if (secondOfPair.isDuplicate() || secondaryAlignments.stream().anyMatch(r -> r.isDuplicate())) { | ||
throw new UserException("First of pair a not duplicate but the rest is " + secondOfPair.getName()); | ||
} | ||
} | ||
return firstOfPair.isDuplicate(); | ||
} | ||
|
||
public List<GATKRead> getReads(final boolean onlyPrimaryAlignments){ | ||
List<GATKRead> reads = Arrays.asList(firstOfPair, secondOfPair); | ||
if (onlyPrimaryAlignments){ | ||
return(reads); | ||
} else { | ||
reads.addAll(secondaryAlignments); | ||
reads.addAll(supplementaryAlignments); | ||
return(reads); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally it's clearly to refer to it as read name but maybe others disagree with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I like read name better