Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing step for RSEM #7752

Merged
merged 7 commits into from
Apr 8, 2022
Merged

Preprocessing step for RSEM #7752

merged 7 commits into from
Apr 8, 2022

Conversation

takutosato
Copy link
Contributor

This tool will be used in the neovax project as a pre-processing step before running RSEM, the gene quantification tool, which has stringent requirements for the format of the input bam.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato I made probably extremely overly nitpicky comments on this. It's unclear to me if this is supposed to be a general user tool or a specific wierd one off for an internal pipeline. If that's the case you could probably just tag it as @Experimental or something and ignore most of my comments.

This makes me want to introduce a ReadNameGroupWalker that does the grouping in the background. SortedPairWalker? I'm not sure what I'd call it... Ted wrote a much more exciting PairWalker class but it doesn't deal with secondary reads in the way you need.

*
* ### Task 2. Removing Reads ###
*
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this set of comments. This says it can remove them and then caveat says it doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contradiction removed.

*
*
* Caveat: This tool does not remove duplicate reads; it assumes it's been removed upstream e.g.
* MarkDuplicates with R
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark duplicates with R?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened was I left IntelliJ to look up what the exact argument name was (it's REMOVE_DUPLICATES) and then got distracted and went on to other places in the code.

*
* ### Task 2. Removing Reads ###
*
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done by MarkDuplicates itself, (or a read filter) unless I misunderstand what you're doing. If the reads are aligned can't you use -XL to screen out the midochondrial reads? It's fine to bake these things into the tool to make it foolproof but I'm not clear why it's necessary to do specially.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the hardcoded list of mitochondrial transcripts below it seems like it would be better to pass in an exclusion file with -XL to remove mitocondria instead of hardcoding the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed hardcoded MT contigs and created an interval list.

@CommandLineProgramProperties(
summary = "",
oneLineSummary = "",
programGroup = ReadDataManipulationProgramGroup.class // Sato: Change to QC when the other PR is merged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented in the other pr but this seems better to me than QC? This isn't really checking anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping ReadDataManipulation

@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME)
public File outSam;

@Argument(fullName = "keep-MT-reads")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lowercase all our full names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, this option is now removed

r.getStart() == read1.getMateStart() && r.getMateStart() == read1.getStart()).findFirst();
if (read2.isPresent()){
result.add(new ImmutablePair<>(read1, read2.get()));
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to also warn on the case where the are multiple matchess?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

// Supplementary reads are not handled i.e. removed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a default read filter to remove supplementary reads up front. You can do that by overriding getDefaultReadFilters()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

throw new UserException("Read names do not match: " + this.queryName + " vs " + read.getName());
}

if (isPrimaryAlignment(read) && read.isFirstOfPair()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these throw if first of pair is already set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup

// If either of the pair is unmapped, throw out.
// With the STAR argument --quantTranscriptomeBan IndelSoftclipSingleend this should not occur,
// but we check just to be thorough.
if (read1.getContig() == null || read2.getContig() == null){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If either read1 or read2 is missing this will crash with a NPE. That's probably something that should be handled since it tends to happen. A crash is a fine answer but detect it and write a sane error message.

Since it seems like you do require both a first and second in pair, one way to simplify some things would be to push these checks into ReadFilters and then at this point you'll just check if there is a full set of reads. It would change the counting of the elimination reason though from being per fragment group to being by read which might not be what you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to happen. But I'm going to invoke the @droazen rule and do the pragmatic thing, which is that if either read1 or read2 is null, we return false (so these reads will not be output).

final List<CigarElement> cigarElements1 = read1.getCigar().getCigarElements();
final List<CigarElement> cigarElements2 = read2.getCigar().getCigarElements();

if (cigarElements1.size() != 1 || cigarElements2.size() != 1){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might pull out a little function and call it twice instead of duplicating the checks.

@takutosato
Copy link
Contributor Author

@lbergelson thanks again for reviewing, back to you.

@gatk-bot
Copy link

gatk-bot commented Apr 7, 2022

Travis reported job failures from build 38623
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 38623.12 logs
integration openjdk8 38623.2 logs


@Override
public List<ReadFilter> getDefaultReadFilters() {
return Collections.singletonList(ReadFilterLibrary.NOT_SUPPLEMENTARY_ALIGNMENT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may or may not want to include WELLFORMED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea should be ok to add it but I'm a bit concerned about it having unintended consequences (e.g. read1 is tossed but it's mate isn't), so I want to leave it out for now.

"The primary firstOfPair is already set. Added read = " + read.getName());
this.firstOfPair = read;
} else if (isPrimaryAlignment(read) && read.isSecondOfPair()) {
this.secondOfPair = read;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, do you also want to throw if second of pair is alreayd set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato A few minor comments. Looks good though. I think it looks a lot cleaner than before. Feel free to merge when ready.

@takutosato takutosato merged commit 3b0bc03 into master Apr 8, 2022
@takutosato takutosato deleted the ts_rsem branch April 8, 2022 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants