Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

Closed
moe1619 opened this issue May 15, 2021 · 25 comments

Comments

@moe1619
Copy link

moe1619 commented May 15, 2021

Hello,

I'm having a little trouble running DROP with a few of my samples combined with the external blood group. I know there's an open ticket for splicing (#154) but I think other users have the other analyses working. I seem to get an error in the AberrantSplicing module but I don't think I call it.

snakemake --cores 1 sampleAnnotation
WARNING: 1 files missing in samples annotation. Ignoring...
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP mae, please ensure that it has at least 10 IDs, groups: ['DJ004', 'DJ002', 'DJ003', 'DJ001']
  File "PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

I've attached my sample annotation file and yaml file.

Thank you!
sample_annotation.txt
config.yaml.txt

@nickhsmith
Copy link
Collaborator

Hi, thanks for using DROP. I think there are a couple of issues, which I hope can help you get things working. We are currently working to automatically run the correct snakemake commands based on how the user defines the config, but currently that is not the case. The default is to run all of the modules (aberrant splicing, aberrant expression, and monoallelic expression). So if you want to only run aberrant expression and monoallelic expression you will need to separately run them by running: snakemake aberrantExpression and snakemake mae respectively.

  1. Looking at your config.yaml file, you will need to redefine the aberrantSplicing module. This currently needs to be defined properly (even if the values are dummy/junk values) so that the config and submodule objects are built correctly. You can use the provided template here as a guide (lines 36-50)
  2. the way the MAE module works is that you need to specify an RNA bam file and its corresponding DNA vcf file and DNA ID (the external counts aren't sufficient for this module), looking at your sample_annotation file this is only done on line 2. If you have access to the other RNA bam files (lines 6-130) and their DNA vcf files and DNA_IDs (3-130) you will need to input them on their corresponding lines
  3. if you don't have the RNA bam files and DNA vcf files for the other samples you will need to remove the value mae from the column DROP_GROUP for the samples where you don't have the RNA bams and corresponding DNA vcfs
  4. at the risk of sounding silly, you also need to specify the full path to each of your files. make sure that /PATH/rna_seq_data/... isn't a copy paste problem. Although I imagine this was done purposely for privacy reasons, I jut wanted to make sure as I have made that mistake more than I'd like to admit.

Please let me know if any of that made sense, and better yet if it was able to help you. If not, please let us know what was confusing or what continues to not work the way you would expect. Best of luck!

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Thank you so much for getting back to me. I am focusing just on aberrantExpression but still having some trouble. I addressed your suggestions below. yaml and sample annotation file attached

  1. I have added back dummy a definition for aberrantSplicing
    2+3. I have removed the MAE group.
  2. PATH is for privacy, but I certainly appreciate the the concern! But I have double checked all of the real paths and they are correct.

I am still having the following problem...

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider2, please ensure that it has at least 10 IDs, groups: []
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

Is it a problem that my files were aligned to ensembl and the control data was aligned to gencode? I can realign if it would fix the problem.

config.yaml.txt
sample_annotation.txt

@vyepez88
Copy link
Collaborator

Hi, can you please rename the groups in the aberrantSplicing and mae dictionaries from outrider2 to outrider? The pipeline is trying to find the outrider2 group among your samples in the sample annotation and is unable to.

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Here is my snakefile in case its helpful. Thanks.

Snakefile (copy).txt

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Same problem...

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider, please ensure that it has at least 10 IDs, groups: ['DJ004', 'DJ002', 'DJ003', 'DJ001']
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

I know its getting passed the submodules/AberrantExpression.py because if I only have 9 samples, I get the following error

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider, please ensure that it has at least 10 IDs, groups: ['DJ001', 'DJ003', 'DJ002', 'DJ004', 'WB100001', 'WB100000', 'WB100003', 'WB100004', 'WB100002']
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 55, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantExpression.py", line 27, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

When I add a 10th row, it errors in the aberrantsplicing module as above

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Just looking in the python code, the aberrant expression module has this line

check number of IDs per group

    all_ids = {g: self.rnaIDs[g] + self.extRnaIDs[g] for g in self.groups}
    self.checkSubset(all_ids)

while the aberrantsplicing module has this
self.name = "AberrantSplicing"
self.rnaIDs = self.sa.subsetGroups(self.groups, assay="RNA")
self.checkSubset(self.rnaIDs)

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

if I comment out line 17 in AberrantSplicing.py "self.checkSubset(self.rnaIDs)", I make it through to the aberrantExpression module. Currently getting some errors there, which I'll investigate. Thanks.

@nickhsmith
Copy link
Collaborator

Thank you for using this, and pushing our code to its limits! This is definitely a big problem, especially if users don't want to run a specific module. We will push forward in fixing this ASAP

In the mean time though, what I've done is remove the minimum required number of samples for each module in a new branch SA_errors

If you pull the changes from there and install them (I just changed line 33 of drop/config/submodules/Submodules.py error=10 -> error=0). Now you shouldn't get any errors from the aberrantSplicing module not having enough samples.

As for your sample annotation and config set up. In the config you should be able to set the aberrantSplicing group to any_junk value, the aberrantExpression group to outrider, and the mae group to mae. And then in the sample_annotation.tsv file set DROP_GROUP column on line 2 to mae,outrider and lines 3-130 to 'outrider'

then when you run snakemake aberrantExpression you will run the group of 129 samples through the aberrantExpression module (using the bam files for lines 2-5 and the external counts for 6-130)

And then snakemake mae you will run the one example of mae you have on line 2 where you specify the matching DNA vcf files in their column.

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Thank you so much. I really appreciate your time.

I apologize, I'm not very savy with Git and I don't want to mess things up since the package seems to be managed by conda.

  1. Could you provide the git command for the pull request? Is it via conda?
  2. I've made it to the aberrantExpression module but I'm getting some errors. It seems to be because I've used one Hg19 reference for my files (ensembl87) while the control files are different (gencode34). I'm happy to re-align my files to gencode34. Could you provide the link to the recommended gencode .fa file?

Thank you!

@nickhsmith
Copy link
Collaborator

No worries.

  1. This should install the new branch. Run it from within your conda env pip install --force-reinstall --no-deps git+https://github.com/gagneurlab/drop.git@SA_errors
  2. For the aberrantExpression module: could you share the explicit errors? It's true that the differences in chromosome stylings are a pain, are all of your external counts of the same style? If they are all the same, then it would probably be easiest to find a matching gtf file (and using that in the config) than to change everything else. But as long as you match them correctly with the GENE_ANNOTATION column from the sample annotation and the key: PATH as defined in the config, you should be fine.

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

The folder /PATH/Output/processed_data/aberrant_expression/ensembl87/ is not created.

This folder /PATH/Output/processed_data/aberrant_expression/gencode34/ holds all of the output including for DJ00* despite the fact that the DJ00* files were created with ensembl87. I believe I've included all of the correct keys in the yml and sample annotation files but the processed_data step still sends the DJ00* files to the gencode34 folder instead of ensembl87.

I've attached the log file. Thanks!

Thanks

2021-05-17T102801.136706.snakemake.log

@nickhsmith
Copy link
Collaborator

Thanks for dealing with this, within an aberrantExpression group you must use the same gene_annotation file. The problem is that when you try to merge 2 samples together, if there are 2 samples where one is using versionA.gtf and the other is using versionB.gtf it won't work.

I think the simplest would be to change it so that you use the same sample annotation file for each sample.

If however you wanted to treat each sample annotation set as a distinct group, and run the ensembl87 as 1 group, and the gencode34 as another you would need to make those 2 distinct aberrantExpression groups in the config and DROP_GROUP columns.

Unfortunately the sampleParams set up that is trying to keep track of the values in the sample annotation file in order to rerun dependent jobs if the sample annotation was changed in a meaningful way was only designed to be run with a single annotation file. If you do not want to configure your data to use a single gene_annotation file (gencode34 for example) you can probably get around this with this hack: cp -r PATH/Output/processed_data/gencode34/params PATH/Output/processed_data/ensembl87/params

This will create the missing sampleParams files which will then be correctly discovered and used for the downstream steps.

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Thank you for working through this with me.

I'm happy to use the same gene annotation file. I think that means I should re-align my samples to gencode34, correct? I'm happy to do that.

Could you provide the link to the recommended gencode .fa file?

For the moment, while I'm getting started, I intend to use the control dataset that you've suggested. In the future, I plan to have adequate samples sequenced internally. Is there a gene annotation that you'd recommend if you were starting today? Most of my samples will probably be in Hg19/GrCH37.

Thanks!

@nickhsmith
Copy link
Collaborator

nickhsmith commented May 17, 2021

Since the majority of your samples already are counted using gencode (which we use as well). I think all you need to do is for the 4 bam files allow those to be counted using the gencode annotation file.

set your config.yaml file to look like this (containing only 1 gene_annotation value):

geneAnnotation:
  gencode34: PATH_TO_GENCODE34

and in the sample annotation tsv file:
for the first 4 samples remove the value ensembl87 from the GENE_ANNOTATION column so that it's empty

That should work (hopefully)

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Working so far! I'll let you know I did encounter this warning...

[E::idx_find_and_load] Could not retrieve index file for '/PATH/DJ002/Aligned.sorted.bam'
samtools idxstats: fail to load index for "/PATH/DJ002/Aligned.sorted.bam", reverting to slow method
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... 

"/PATH/DJ002/Aligned.sorted.bam.bai" exists in the same folder. Is there a reason its not recognizing the index? Does it matter?

Thanks!

@nickhsmith
Copy link
Collaborator

I'm glad to hear it's doing something! Let me know how it goes and how it looks at the end, hopefully this works!

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

Unfortunately another error! I've attached the error message...

Maybe its still a problem that the external matrix was done using gencode while my files were aligned using ensembl?

Thanks!

log.txt

@moe1619
Copy link
Author

moe1619 commented May 17, 2021

I re-aligned my samples to gencode v34 and still get the above error. /PATH/.snakemake/scripts is empty.

I may have found part of the error. I looked at the count matrices for my files (DJ00*) and compared them to the external count matrices.

My files have 62492 transcripts. The external count matrix has 62456 transcripts. I've attached a list of the transcripts in my data that don't exist in the external counts. I haven't looked at every single one but they seem to be transcripts for mitochondrial genes.

Is there anyway to only analyze the intersection of the transcripts?
problematic transcripts.txt

Thanks

@nickhsmith
Copy link
Collaborator

nickhsmith commented May 18, 2021

The canonical chromosomes that are used in the aberrantExpression pipeline during the counting step are chromosomes 1-22, X, Y, M. I'm curious if the external counts (the WB samples) were built using a different canonical set (excluding genes on the chrM/MT)

That could explain the difference. If that is the case, I would try to either get the external counts with the mitochondria genes. OR remove the mitochondria genes from the RNA bam files (DJ00*). You can do that by running the following (essentially only keep elements that do not align to chrM or MT):

samtools view -h INPUT_BAM | awk '{if($3 != "chrM" && $3 != "MT"){print $0}}' | samtools view -Shb - > OUTPUT_FILTERED_BAM

Then you should be able to use the OUTPUT_FILTERED_BAM in your sample annotation for the DJ00* samples. So that when you are counting you won't have any reads that align to the transcripts on the mitochondrial chromosome, and when you merge you will be working with the same set of transcripts.

**On second thought. The way the merging is done, the data frame used internally takes all of the transcripts from the gtf file and then starts counting whether or not a read falls within the transcript bounds.

So this means that the external counts must have been assembled using a different gtf file version than the one provided in the config, it seems like the gtf for the external counts has removed the chromosome MT from gencode.v38lift37.annotation.gtf

The hot fix is to remove the chromosome MT from the gtf file, but we are in contact with the provider of the external counts for a more permanent solution.

@moe1619
Copy link
Author

moe1619 commented May 18, 2021

Thanks again for working through this with me. I've got some steps forward and steps back...

  1. Backwards: In the course of debugging the transcript issue, I realized that the counts for my BAMS were all 0s. When I look at the BAMS in IGV, they're fine so I'm not sure why. The call to summarizeOverlaps in countReads.R returns 0 counts for all genes. I'm pretty sure I've got the correct GTF for alignment so I'm not sure why it doesn't work.
  2. Forwards: I aligned my 4 BAMS again with STAR but this time, I had STAR output gene counts. I've merged these 4 gene count matrices with the external matrix, keeping only transcripts common to both. I had to change the error threshold to 0 from 1 on line 93 of MonoallelicExpression.py to get past this check. This is currently running! (fingers crossed)
  3. Regarding your comment above about the GTF used for the alignment, the website for the external counts indicates that it's gencode v34, not v38. I believe you wrote the name of the v38.gtf. Does that matter?

Thanks!

@vyepez88
Copy link
Collaborator

Hi, great to see that it's working with the STAR aligned files.
Yes, Nick meant gencode v34, not v38. Keep us updated!

@nickhsmith
Copy link
Collaborator

nickhsmith commented May 18, 2021

I was actually looking at the gtf you provided in your config file. You wrote in your original config.yaml file:

geneAnnotation:
  gencode34: /PATH/reference_genomes/hg37_gencode/gencode.v38lift37.annotation.gtf

If the external counts were gencode v34, and you have gencode v38 that could certainly create some confusion. But maybe you've changed the config since you posted the example.

I'm glad to hear things are running, let us know how things go.

@moe1619
Copy link
Author

moe1619 commented May 18, 2021 via email

@moe1619
Copy link
Author

moe1619 commented May 18, 2021

OK! I made it through! Thank you for all of your help! A few outstanding issues but I think we should close this issue because I'm not sure I'll have time to follow up in the near future.

  • I never got the analysis working with BAM files. I had to use the gene count matrix from STAR. I have to explore why summarizeOverlaps return 0. There's probably a mistake in my alignment.
  • I had to change the error threshold to 0 from 1 on line 93 of MonoallelicExpression.py to get past this check. I'm not sure why the MAE pipeline is called.

Ok, thank you! I'm sure I'll have more questions in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants