Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing step for RSEM #7752

Merged
merged 7 commits into from
Apr 8, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
package org.broadinstitute.hellbender.tools.walkers.qc;

import htsjdk.samtools.CigarElement;
import htsjdk.samtools.CigarOperator;
import htsjdk.samtools.SAMFileHeader;
import htsjdk.samtools.util.PeekableIterator;
import org.apache.commons.lang3.tuple.ImmutablePair;
import org.apache.commons.lang3.tuple.Pair;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.engine.*;
import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.utils.read.GATKRead;
import org.broadinstitute.hellbender.utils.read.SAMFileGATKReadWriter;
import picard.cmdline.programgroups.ReadDataManipulationProgramGroup;

import java.io.File;
import java.util.*;
import java.util.stream.Collectors;

/**
* Performs post-processing steps to get a bam aligned to a transcriptome ready for RSEM (https://github.com/deweylab/RSEM)
*
* ### Task 1. Reordering Reads ###
*
* Suppose the queryname "Q1" aligns to multiple loci in the transcriptome.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally it's clearly to refer to it as read name but maybe others disagree with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I like read name better

* STAR aligner outputs the reads in the following order:
*
* Q1: Read1 (Chr 1:1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just be clear, you want the reads sorted first by readname and then subsorted with position as the first tiebreaker instead of first in pair/second of pair?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misread this the first time through because I interpreted Read1 as the name of a read. I might clarify it somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed Read1 to First-of-pair

* Q1: Read2 (Chr 1:2000)
* Q1: Read1 (Chr 20:5000)
* Q1: Read2 (Chr 20:6000)
*
* This is the format required by RSEM. After query-name sorting the reads for duplicate marking,
* the reads will be ordered as follows;
*
* Q1: Read1 (Chr 1:1000)
* Q1: Read1 (Chr 20:5000)
* Q1: Read2 (Chr 1:2000)
* Q1: Read2 (Chr 20:6000)
*
* That is, all the read1's come first, then read2's.
*
* This tool reorders such that the alignment pair appears together as in the first example.
*
* ### Task 2. Removing Reads ###
*
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this set of comments. This says it can remove them and then caveat says it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contradiction removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done by MarkDuplicates itself, (or a read filter) unless I misunderstand what you're doing. If the reads are aligned can't you use -XL to screen out the midochondrial reads? It's fine to bake these things into the tool to make it foolproof but I'm not clear why it's necessary to do specially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the hardcoded list of mitochondrial transcripts below it seems like it would be better to pass in an exclusion file with -XL to remove mitocondria instead of hardcoding the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed hardcoded MT contigs and created an interval list.

*
*
* Caveat: This tool does not remove duplicate reads; it assumes it's been removed upstream e.g.
* MarkDuplicates with R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark duplicates with R?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened was I left IntelliJ to look up what the exact argument name was (it's REMOVE_DUPLICATES) and then got distracted and went on to other places in the code.

*
*/
@CommandLineProgramProperties(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs summaries, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

summary = "",
oneLineSummary = "",
programGroup = ReadDataManipulationProgramGroup.class // Sato: Change to QC when the other PR is merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in the other pr but this seems better to me than QC? This isn't really checking anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping ReadDataManipulation

)
public class PostProcessReadsForRSEM extends GATKTool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have @DocumentedFeature and ideally you would add the appropriate @WorkflowProperties tags in order to support auto WDL gen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for beta and experimental, those are sort of mutually exclusive. Beta means "this is a first of something that's going to become a real tool" while experimental is "this thing is weird and may never be supported"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool needs to override requiresReads() and have it return true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME)
public File outSam;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer GATKPath unless there's a strong reason not to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Argument(fullName = "keep-MT-reads")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lowercase all our full names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, this option is now removed

public boolean keepMTReads = false;

PeekableIterator<GATKRead> read1Iterator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved into traverse and initialized there. It might simplify the special case handling of the first pair as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved into traverse, still need to initialize the first pair

SAMFileGATKReadWriter writer;

ReadPair currentReadPair;

// Use CountingFilter or another existing framework for counting and printing filtered reads
int totalOutputPair = 0;
int notBothMapped = 0;
int chimera = 0;
int unsupportedCigar = 0;
int duplicates = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

int mitochondria = 0;

@Override
public void onTraversalStart(){
read1Iterator = new PeekableIterator<>(directlyAccessEngineReadsDataSource().iterator());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good idea. If you do this you will ignore any filters / transformers. You probably mean to call getTransformedReadStream(getReadFilter()).iterator() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice I did not know this was the recommended way but now I do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be Peekable if you never peek it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch I must've stopped peeking at some point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is confusing since there is no read2Iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad a result of copy+paste


if (directlyAccessEngineReadsDataSource().getHeader().getSortOrder() != SAMFileHeader.SortOrder.queryname){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use getHeaderForReads()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

throw new UserException("Input must be query-name sorted.");
}

writer = createSAMWriter(new GATKPath(outSam.getAbsolutePath()), true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just make the GATKPath the argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup


if (!read1Iterator.hasNext()){
throw new UserException("Input has no reads");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that you want to explode on an empty input? It definitely might be the right choice but often people want to be able to silently process empty files through the pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call I like it that way, fixed.

}

currentReadPair = new ReadPair(read1Iterator.next());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be made local to traverse and the last one is just handled out of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point

totalOutputPair++;
}

@Override
public void traverse() {
while (read1Iterator.hasNext()){
final GATKRead read = read1Iterator.next();
if (!currentReadPair.getQueryName().equals(read.getName())){
// We have gathered all the reads with the same query name (primary, secondary, and supplementary alignments)
// Write the reads to output file, reordering the reads as needed.
writeReads(currentReadPair);
currentReadPair = new ReadPair(read);
} else {
currentReadPair.add(read);
}
progressMeter.update(read);
}
}

public boolean passesRSEMFilter(final GATKRead read1, final GATKRead read2) {
// If either of the pair is unmapped, throw out.
// With the STAR argument --quantTranscriptomeBan IndelSoftclipSingleend this should not occur,
// but we check just to be thorough.
if (read1.getContig() == null || read2.getContig() == null){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be slightly cleaner to check !read1.isUnmapped()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either read1 or read2 is missing this will crash with a NPE. That's probably something that should be handled since it tends to happen. A crash is a fine answer but detect it and write a sane error message.

Since it seems like you do require both a first and second in pair, one way to simplify some things would be to push these checks into ReadFilters and then at this point you'll just check if there is a full set of reads. It would change the counting of the elimination reason though from being per fragment group to being by read which might not be what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to happen. But I'm going to invoke the @droazen rule and do the pragmatic thing, which is that if either read1 or read2 is null, we return false (so these reads will not be output).

notBothMapped++;
return false;
}

// Chimeric reads are not allowed
if (!read1.getContig().equals(read2.getContig())){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another minor thing, you could use !read1.contigsMatch(read2). Is it the case that STAR will always put the two primary ones on the same transcript before it starts generating secondary reads which are then put together on another transcipt? Or will it generate mixed pairs where Read1Primary and Read2Secondary are on one transcipt and Read1Secondary and Read2Primary are put together on a second transcript.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done—not sure I understand your question. Did I maybe answer it yesterday?

chimera++;
return false;
}

// Cigar must have a single operator M.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indels and clipping are not allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea which seems odd but I'm sure they have a good reason

final List<CigarElement> cigarElements1 = read1.getCigar().getCigarElements();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer read1.numCigarElements()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final List<CigarElement> cigarElements2 = read2.getCigar().getCigarElements();

if (cigarElements1.size() != 1 || cigarElements2.size() != 1){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might pull out a little function and call it twice instead of duplicating the checks.

unsupportedCigar++;
return false;
}

if (cigarElements1.get(0).getOperator() != CigarOperator.M ||
cigarElements2.get(0).getOperator() != CigarOperator.M){
unsupportedCigar++;
return false;
} else {
return true;
}
}

private final static HashSet<String> MT_TRANSCRIPT_IDs = new HashSet<>(Arrays.asList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very brittle. If this tool is meant to be general purpose there should be a way to supply the transcript names. If it's not meant to be general purpose it should clearly state that and what transcript model it's assuming. (and ideally detect if it's input is a mismatch. Maybe hard... possibly a neat use for a bloom filter.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now doing -XL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably push this up to the top where other variables are assigned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed all the code related to MT handling

"ENST00000387314.1", "ENST00000389680.2", "ENST00000387342.1", "ENST00000387347.2", "ENST00000386347.1",
"ENST00000361390.2", "ENST00000387365.1", "ENST00000387372.1", "ENST00000387377.1", "ENST00000361453.3",
"ENST00000387382.1", "ENST00000387392.1", "ENST00000387400.1", "ENST00000387405.1", "ENST00000387409.1",
"ENST00000361624.2", "ENST00000387416.2", "ENST00000387419.1", "ENST00000361739.1", "ENST00000387421.1",
"ENST00000361851.1", "ENST00000361899.2", "ENST00000362079.2", "ENST00000387429.1", "ENST00000361227.2",
"ENST00000387439.1", "ENST00000361335.1", "ENST00000361381.2", "ENST00000387441.1", "ENST00000387449.1",
"ENST00000387456.1", "ENST00000361567.2", "ENST00000361681.2", "ENST00000387459.1", "ENST00000361789.2",
"ENST00000387460.2", "ENST00000387461.2"
)); // ENST00000389680.2 overlaps with the twist targets
private boolean mitochondrialRead(final GATKRead read){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename isMitochondrialRead

return MT_TRANSCRIPT_IDs.contains(read.getContig());
}

private boolean passesAdditionalFilters(final GATKRead r1, final GATKRead r2){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusingly named and the comment seems out of date. It's weird that it has side effects too. It's just counting, so not a big deal, just weird.

if (!keepMTReads){
// If the read pair is duplicate, then return false
if (mitochondrialRead(r1) || mitochondrialRead(r2)){
mitochondria++;
return false;
}
}

return true;
}

/** Write reads in this order
* 1. Primary. First of pair, second of pair
* 2. For each secondary. First of pair, second of pair.
* **/
private void writeReads(final ReadPair readPair){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that secondary reads are always generated in pairs? I thought it was common for a secondary read to be mated to the primary one. Maybe STAR enforces that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STAR does enforce that the secondary reads are always generated in pairs.

// Update read by side effect
final GATKRead firstOfPair = readPair.getFirstOfPair();
final GATKRead secondOfPair = readPair.getSecondOfPair();

// Write Primary Reads. If either fails, we discard both, and we don't bother with the secondary alignments.
if (passesRSEMFilter(firstOfPair, secondOfPair) && passesAdditionalFilters(firstOfPair, secondOfPair)){
writer.addRead(firstOfPair);
writer.addRead(secondOfPair);
totalOutputPair += 1;

// Now handle secondary alignments.
final List<Pair<GATKRead, GATKRead>> secondaryAlignmentPairs = groupSecondaryReads(readPair.getSecondaryAlignments());
for (Pair<GATKRead, GATKRead> mates : secondaryAlignmentPairs){
// The pair is either both written or both not written.
if (passesRSEMFilter(mates.getLeft(), mates.getRight()) &&
passesAdditionalFilters(mates.getLeft(), mates.getRight())) {
writer.addRead(mates.getLeft());
writer.addRead(mates.getRight());
totalOutputPair += 1;

}
}

// Supplementary reads are not handled i.e. removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a default read filter to remove supplementary reads up front. You can do that by overriding getDefaultReadFilters()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

/**
*
* Reorder the secondary alignments as described above.
* @param secondaryAlignments may be empty.
* @return a list of pair of matching reads (i.e. read1 with the corresponding read2)
*/
private List<Pair<GATKRead, GATKRead>> groupSecondaryReads(final List<GATKRead> secondaryAlignments){
if (secondaryAlignments.isEmpty()){
return Collections.emptyList();
}

final boolean isRead1 = true;
final Map<Boolean, List<GATKRead>> groupdByRead1 =
secondaryAlignments.stream().collect(Collectors.groupingBy(r -> r.isFirstOfPair()));
final List<GATKRead> read1Reads = groupdByRead1.get(isRead1);
final List<GATKRead> read2Reads = groupdByRead1.get(!isRead1);
if(read1Reads.size() != read2Reads.size()){
logger.warn("Num read1s != num read2s among the secondary alignments; " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include the number of each here.

secondaryAlignments.get(0).getName());
return Collections.emptyList();
}


final List<Pair<GATKRead, GATKRead>> result = new ArrayList<>(read1Reads.size());
for (GATKRead read1 : read1Reads){
final Optional<GATKRead> read2 = read2Reads.stream().filter(r ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but I like to format stream operations with 1 operation per line,

            final Optional<GATKRead> read2 = read2Reads.stream()
                    .filter(r -> r.getStart() == read1.getMateStart() 
                            && r.getMateStart() == read1.getStart())
                    .findFirst();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, done

r.getStart() == read1.getMateStart() && r.getMateStart() == read1.getStart()).findFirst();
if (read2.isPresent()){
result.add(new ImmutablePair<>(read1, read2.get()));
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also warn on the case where the are multiple matchess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

logger.warn("Mate not found for the secondary alignment " + read1.getName());
}
}
return result;
}

@Override
public Object onTraversalSuccess(){
// Write out the last set of reads
writeReads(currentReadPair);
logger.info("Total read pairs output: " + totalOutputPair);
logger.info("Read pairs filtered due to unmapped mate: " + notBothMapped);
logger.info("Read pairs filtered due to chimeric alignment: " + chimera);
logger.info("Read pairs filtered due to a cigar element not supported by RSEM: " + unsupportedCigar);
return "SUCCESS";
}

@Override
public void closeTool(){
writer.close();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a null check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David's favorite check

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
package org.broadinstitute.hellbender.tools.walkers.qc;

import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.utils.read.GATKRead;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class ReadPair {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadPair might not be ideally named since this is about more than just a pair. ReadsWithSameName would be more correct but awkward. Not sure what the right solution is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True and true…hmmmm let me think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might recommend making this a private subclass or a non-public class and then just ripping out all the supplementary read handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this class for a different tool that might be merged soon—is it OK if I leave it public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's fine. We can always refactor once they're both hin.

private GATKRead firstOfPair;
private GATKRead secondOfPair;
private List<GATKRead> secondaryAlignments = new ArrayList<>(10);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 10 a good guess for capacity here? Do we expect a ton of secondary/supplementary reads on transcriptome data. I would either leave it blank (which is effectively the same but is simpler). Set the default to something low like 2. Or lazily initialize the arrays because they will usually not be needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea good point, leaving it blank.

private List<GATKRead> supplementaryAlignments = new ArrayList<>(10); // Finally understand the difference
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a great moment

private String queryName = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Contributor Author

@takutosato takutosato Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made lists final but could make the others final lists and string are now final



public ReadPair() { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused, remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes


public ReadPair(final GATKRead read) {
this.queryName = read.getName();
add(read);
}

public int size(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size needs javadoc because it's not clear what it's the size of. Someone could reasonably assume it's the territory covered by the reads. Or rename it to numberOfReads() or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

int size = 0;
size = firstOfPair != null ? size + 1 : size;
size = secondOfPair != null ? size + 1 : size;
size += secondaryAlignments.size();
return size += supplementaryAlignments.size(); // what should we do with the supplementary alignments?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be answered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we should include it

}

public String getQueryName() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getName is probably nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

return queryName;
}

public GATKRead getFirstOfPair(){
return firstOfPair;
}

public GATKRead getSecondOfPair(){
return secondOfPair;
}

public List<GATKRead> getSecondaryAlignments() {
return secondaryAlignments;
}

public void add(final GATKRead read) {
if (this.queryName == null){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is final and initialized in the constructor you don't have to worry about checking it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

this.queryName = read.getName();
}

if (! this.queryName.equals(read.getName())){
throw new UserException("Read names do not match: " + this.queryName + " vs " + read.getName());
}

if (isPrimaryAlignment(read) && read.isFirstOfPair()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these throw if first of pair is already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

this.firstOfPair = read;
} else if (isPrimaryAlignment(read) && read.isSecondOfPair()) {
this.secondOfPair = read;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, do you also want to throw if second of pair is alreayd set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else if (read.isSecondaryAlignment()) {
this.secondaryAlignments.add(read);
} else if (read.isSupplementaryAlignment()) {
this.supplementaryAlignments.add(read);
} else {
throw new UserException("Unknown read type");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible way for this to happen is a primary read that is not first of pair or second of pair which is allowed by the spec although not used in practice by any technology I know about. It would be good to print out the read as part of the message for debugging purposes, or at least the read name so you could find it easily.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

private boolean isPrimaryAlignment(final GATKRead read){
return ! read.isSecondaryAlignment() && ! read.isSupplementaryAlignment();
}

public boolean isDuplicateMarked() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are partially marked groups a problem you run into? I thought that mark duplicates handled that correctly if the inputs are in queryname order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea MarkDuplicates should be consistent here; removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't used, should it be deleted?

// Doing some investigation
if (firstOfPair.isDuplicate()) {
// Make sure the rest is duplicate-marked
if (!secondOfPair.isDuplicate() || secondaryAlignments.stream().anyMatch(r -> !r.isDuplicate())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about supplementary alignment duplicate marking? It's not looked at here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not; I think I used this function in a different branch that might be merged in the future, so I will just leave it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. it's not used by the RSEM processing tool

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, that makes sense. We can revisit later if necessary.

throw new UserException("First of pair a duplicate but the rest is not" + secondOfPair.getName());
}
} else {
// Make sure the rest is not duplicate-marked
if (secondOfPair.isDuplicate() || secondaryAlignments.stream().anyMatch(r -> r.isDuplicate())) {
throw new UserException("First of pair a not duplicate but the rest is " + secondOfPair.getName());
}
}
return firstOfPair.isDuplicate();
}

public List<GATKRead> getReads(final boolean onlyPrimaryAlignments){
List<GATKRead> reads = Arrays.asList(firstOfPair, secondOfPair);
if (onlyPrimaryAlignments){
return(reads);
} else {
reads.addAll(secondaryAlignments);
reads.addAll(supplementaryAlignments);
return(reads);
}
}
}
Loading