Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

moe1619 · 2021-05-15T14:50:47Z

Hello,

I'm having a little trouble running DROP with a few of my samples combined with the external blood group. I know there's an open ticket for splicing (#154) but I think other users have the other analyses working. I seem to get an error in the AberrantSplicing module but I don't think I call it.

snakemake --cores 1 sampleAnnotation
WARNING: 1 files missing in samples annotation. Ignoring...
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP mae, please ensure that it has at least 10 IDs, groups: ['DJ004', 'DJ002', 'DJ003', 'DJ001']
  File "PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

I've attached my sample annotation file and yaml file.

Thank you!
sample_annotation.txt
config.yaml.txt

The text was updated successfully, but these errors were encountered:

nickhsmith · 2021-05-17T12:40:19Z

Hi, thanks for using DROP. I think there are a couple of issues, which I hope can help you get things working. We are currently working to automatically run the correct snakemake commands based on how the user defines the config, but currently that is not the case. The default is to run all of the modules (aberrant splicing, aberrant expression, and monoallelic expression). So if you want to only run aberrant expression and monoallelic expression you will need to separately run them by running: snakemake aberrantExpression and snakemake mae respectively.

Looking at your config.yaml file, you will need to redefine the aberrantSplicing module. This currently needs to be defined properly (even if the values are dummy/junk values) so that the config and submodule objects are built correctly. You can use the provided template here as a guide (lines 36-50)
the way the MAE module works is that you need to specify an RNA bam file and its corresponding DNA vcf file and DNA ID (the external counts aren't sufficient for this module), looking at your sample_annotation file this is only done on line 2. If you have access to the other RNA bam files (lines 6-130) and their DNA vcf files and DNA_IDs (3-130) you will need to input them on their corresponding lines
if you don't have the RNA bam files and DNA vcf files for the other samples you will need to remove the value mae from the column DROP_GROUP for the samples where you don't have the RNA bams and corresponding DNA vcfs
at the risk of sounding silly, you also need to specify the full path to each of your files. make sure that /PATH/rna_seq_data/... isn't a copy paste problem. Although I imagine this was done purposely for privacy reasons, I jut wanted to make sure as I have made that mistake more than I'd like to admit.

Please let me know if any of that made sense, and better yet if it was able to help you. If not, please let us know what was confusing or what continues to not work the way you would expect. Best of luck!

moe1619 · 2021-05-17T13:30:59Z

Thank you so much for getting back to me. I am focusing just on aberrantExpression but still having some trouble. I addressed your suggestions below. yaml and sample annotation file attached

I have added back dummy a definition for aberrantSplicing
2+3. I have removed the MAE group.
PATH is for privacy, but I certainly appreciate the the concern! But I have double checked all of the real paths and they are correct.

I am still having the following problem...

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider2, please ensure that it has at least 10 IDs, groups: []
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

Is it a problem that my files were aligned to ensembl and the control data was aligned to gencode? I can realign if it would fix the problem.

config.yaml.txt
sample_annotation.txt

vyepez88 · 2021-05-17T13:34:02Z

Hi, can you please rename the groups in the aberrantSplicing and mae dictionaries from outrider2 to outrider? The pipeline is trying to find the outrider2 group among your samples in the sample annotation and is unable to.

moe1619 · 2021-05-17T13:34:45Z

Here is my snakefile in case its helpful. Thanks.

Snakefile (copy).txt

moe1619 · 2021-05-17T13:45:18Z

Same problem...

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider, please ensure that it has at least 10 IDs, groups: ['DJ004', 'DJ002', 'DJ003', 'DJ001']
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 63, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantSplicing.py", line 17, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

moe1619 · 2021-05-17T13:52:43Z

I know its getting passed the submodules/AberrantExpression.py because if I only have 9 samples, I get the following error

snakemake aberrantExpression --cores 1
ValueError in line 13 of /PATH/Snakefile:
Too few IDs in DROP_GROUP outrider, please ensure that it has at least 10 IDs, groups: ['DJ001', 'DJ003', 'DJ002', 'DJ004', 'WB100001', 'WB100000', 'WB100003', 'WB100004', 'WB100002']
  File "/PATH/Snakefile", line 13, in <module>
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/DropConfig.py", line 55, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/AberrantExpression.py", line 27, in __init__
  File "/PATH/miniconda3/lib/python3.8/site-packages/drop/config/submodules/Submodules.py", line 45, in checkSubset

When I add a 10th row, it errors in the aberrantsplicing module as above

moe1619 · 2021-05-17T14:04:41Z

Just looking in the python code, the aberrant expression module has this line

check number of IDs per group

    all_ids = {g: self.rnaIDs[g] + self.extRnaIDs[g] for g in self.groups}
    self.checkSubset(all_ids)

while the aberrantsplicing module has this
self.name = "AberrantSplicing"
self.rnaIDs = self.sa.subsetGroups(self.groups, assay="RNA")
self.checkSubset(self.rnaIDs)

moe1619 · 2021-05-17T14:10:35Z

if I comment out line 17 in AberrantSplicing.py "self.checkSubset(self.rnaIDs)", I make it through to the aberrantExpression module. Currently getting some errors there, which I'll investigate. Thanks.

nickhsmith · 2021-05-17T14:25:25Z

Thank you for using this, and pushing our code to its limits! This is definitely a big problem, especially if users don't want to run a specific module. We will push forward in fixing this ASAP

In the mean time though, what I've done is remove the minimum required number of samples for each module in a new branch SA_errors

If you pull the changes from there and install them (I just changed line 33 of drop/config/submodules/Submodules.py error=10 -> error=0). Now you shouldn't get any errors from the aberrantSplicing module not having enough samples.

As for your sample annotation and config set up. In the config you should be able to set the aberrantSplicing group to any_junk value, the aberrantExpression group to outrider, and the mae group to mae. And then in the sample_annotation.tsv file set DROP_GROUP column on line 2 to mae,outrider and lines 3-130 to 'outrider'

then when you run snakemake aberrantExpression you will run the group of 129 samples through the aberrantExpression module (using the bam files for lines 2-5 and the external counts for 6-130)

And then snakemake mae you will run the one example of mae you have on line 2 where you specify the matching DNA vcf files in their column.

moe1619 · 2021-05-17T14:45:49Z

Thank you so much. I really appreciate your time.

I apologize, I'm not very savy with Git and I don't want to mess things up since the package seems to be managed by conda.

Could you provide the git command for the pull request? Is it via conda?
I've made it to the aberrantExpression module but I'm getting some errors. It seems to be because I've used one Hg19 reference for my files (ensembl87) while the control files are different (gencode34). I'm happy to re-align my files to gencode34. Could you provide the link to the recommended gencode .fa file?

Thank you!

nickhsmith · 2021-05-17T14:59:35Z

No worries.

This should install the new branch. Run it from within your conda env pip install --force-reinstall --no-deps git+https://github.com/gagneurlab/drop.git@SA_errors
For the aberrantExpression module: could you share the explicit errors? It's true that the differences in chromosome stylings are a pain, are all of your external counts of the same style? If they are all the same, then it would probably be easiest to find a matching gtf file (and using that in the config) than to change everything else. But as long as you match them correctly with the GENE_ANNOTATION column from the sample annotation and the key: PATH as defined in the config, you should be fine.

moe1619 · 2021-05-17T15:11:19Z

The folder /PATH/Output/processed_data/aberrant_expression/ensembl87/ is not created.

This folder /PATH/Output/processed_data/aberrant_expression/gencode34/ holds all of the output including for DJ00* despite the fact that the DJ00* files were created with ensembl87. I believe I've included all of the correct keys in the yml and sample annotation files but the processed_data step still sends the DJ00* files to the gencode34 folder instead of ensembl87.

I've attached the log file. Thanks!

Thanks

2021-05-17T102801.136706.snakemake.log

nickhsmith · 2021-05-17T16:01:33Z

Thanks for dealing with this, within an aberrantExpression group you must use the same gene_annotation file. The problem is that when you try to merge 2 samples together, if there are 2 samples where one is using versionA.gtf and the other is using versionB.gtf it won't work.

I think the simplest would be to change it so that you use the same sample annotation file for each sample.

If however you wanted to treat each sample annotation set as a distinct group, and run the ensembl87 as 1 group, and the gencode34 as another you would need to make those 2 distinct aberrantExpression groups in the config and DROP_GROUP columns.

Unfortunately the sampleParams set up that is trying to keep track of the values in the sample annotation file in order to rerun dependent jobs if the sample annotation was changed in a meaningful way was only designed to be run with a single annotation file. If you do not want to configure your data to use a single gene_annotation file (gencode34 for example) you can probably get around this with this hack: cp -r PATH/Output/processed_data/gencode34/params PATH/Output/processed_data/ensembl87/params

This will create the missing sampleParams files which will then be correctly discovered and used for the downstream steps.

moe1619 · 2021-05-17T16:09:49Z

Thank you for working through this with me.

I'm happy to use the same gene annotation file. I think that means I should re-align my samples to gencode34, correct? I'm happy to do that.

Could you provide the link to the recommended gencode .fa file?

For the moment, while I'm getting started, I intend to use the control dataset that you've suggested. In the future, I plan to have adequate samples sequenced internally. Is there a gene annotation that you'd recommend if you were starting today? Most of my samples will probably be in Hg19/GrCH37.

Thanks!

nickhsmith · 2021-05-17T16:19:07Z

Since the majority of your samples already are counted using gencode (which we use as well). I think all you need to do is for the 4 bam files allow those to be counted using the gencode annotation file.

set your config.yaml file to look like this (containing only 1 gene_annotation value):

geneAnnotation:
  gencode34: PATH_TO_GENCODE34

and in the sample annotation tsv file:
for the first 4 samples remove the value ensembl87 from the GENE_ANNOTATION column so that it's empty

That should work (hopefully)

moe1619 · 2021-05-17T16:31:24Z

Working so far! I'll let you know I did encounter this warning...

[E::idx_find_and_load] Could not retrieve index file for '/PATH/DJ002/Aligned.sorted.bam'
samtools idxstats: fail to load index for "/PATH/DJ002/Aligned.sorted.bam", reverting to slow method
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ...

"/PATH/DJ002/Aligned.sorted.bam.bai" exists in the same folder. Is there a reason its not recognizing the index? Does it matter?

Thanks!

nickhsmith · 2021-05-17T16:49:19Z

I'm glad to hear it's doing something! Let me know how it goes and how it looks at the end, hopefully this works!

moe1619 · 2021-05-17T17:16:43Z

Unfortunately another error! I've attached the error message...

Maybe its still a problem that the external matrix was done using gencode while my files were aligned using ensembl?

Thanks!

log.txt

moe1619 · 2021-05-17T20:09:26Z

I re-aligned my samples to gencode v34 and still get the above error. /PATH/.snakemake/scripts is empty.

I may have found part of the error. I looked at the count matrices for my files (DJ00*) and compared them to the external count matrices.

My files have 62492 transcripts. The external count matrix has 62456 transcripts. I've attached a list of the transcripts in my data that don't exist in the external counts. I haven't looked at every single one but they seem to be transcripts for mitochondrial genes.

Is there anyway to only analyze the intersection of the transcripts?
problematic transcripts.txt

Thanks

nickhsmith · 2021-05-18T08:54:04Z

The canonical chromosomes that are used in the aberrantExpression pipeline during the counting step are chromosomes 1-22, X, Y, M. I'm curious if the external counts (the WB samples) were built using a different canonical set (excluding genes on the chrM/MT)

That could explain the difference. If that is the case, I would try to either get the external counts with the mitochondria genes. OR remove the mitochondria genes from the RNA bam files (DJ00*). You can do that by running the following (essentially only keep elements that do not align to chrM or MT):

~~samtools view -h INPUT_BAM | awk '{if($3 != "chrM" && $3 != "MT"){print $0}}' | samtools view -Shb - > OUTPUT_FILTERED_BAM~~

Then you should be able to use the OUTPUT_FILTERED_BAM in your sample annotation for the DJ00* samples. So that when you are counting you won't have any reads that align to the transcripts on the mitochondrial chromosome, and when you merge you will be working with the same set of transcripts.

**On second thought. The way the merging is done, the data frame used internally takes all of the transcripts from the gtf file and then starts counting whether or not a read falls within the transcript bounds.

So this means that the external counts must have been assembled using a different gtf file version than the one provided in the config, it seems like the gtf for the external counts has removed the chromosome MT from gencode.v38lift37.annotation.gtf

The hot fix is to remove the chromosome MT from the gtf file, but we are in contact with the provider of the external counts for a more permanent solution.

moe1619 · 2021-05-18T13:06:59Z

Thanks again for working through this with me. I've got some steps forward and steps back...

Backwards: In the course of debugging the transcript issue, I realized that the counts for my BAMS were all 0s. When I look at the BAMS in IGV, they're fine so I'm not sure why. The call to summarizeOverlaps in countReads.R returns 0 counts for all genes. I'm pretty sure I've got the correct GTF for alignment so I'm not sure why it doesn't work.
Forwards: I aligned my 4 BAMS again with STAR but this time, I had STAR output gene counts. I've merged these 4 gene count matrices with the external matrix, keeping only transcripts common to both. I had to change the error threshold to 0 from 1 on line 93 of MonoallelicExpression.py to get past this check. This is currently running! (fingers crossed)
Regarding your comment above about the GTF used for the alignment, the website for the external counts indicates that it's gencode v34, not v38. I believe you wrote the name of the v38.gtf. Does that matter?

Thanks!

vyepez88 · 2021-05-18T13:21:02Z

Hi, great to see that it's working with the STAR aligned files.
Yes, Nick meant gencode v34, not v38. Keep us updated!

nickhsmith · 2021-05-18T14:59:26Z

I was actually looking at the gtf you provided in your config file. You wrote in your original config.yaml file:

geneAnnotation:
  gencode34: /PATH/reference_genomes/hg37_gencode/gencode.v38lift37.annotation.gtf

If the external counts were gencode v34, and you have gencode v38 that could certainly create some confusion. But maybe you've changed the config since you posted the example.

I'm glad to hear things are running, let us know how things go.

moe1619 · 2021-05-18T15:10:56Z

I subsequently changed to v34 after posting the original YAML and unfortunately it didn't help. But thank you for checking. I'll keep you updated. Thanks!

…

On Tue, May 18, 2021 at 10:59 AM nickhsmith ***@***.***> wrote: I was actually looking at the gtf you provided in your config file. You wrote in your config.yaml file: geneAnnotation: gencode34: /PATH/reference_genomes/hg37_gencode/gencode.v38lift37.annotation.gtf If the external counts were gencode v34, and you have gencode v38 that could certainly create some confusion. I'm glad to hear things are running, let us know how things go. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#193 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEGB3YNJNO4PRMWUP7MKQTTOJ6GPANCNFSM4456V5AQ> .

-- ------------------------------------------------- Joshua E. Motelow, MD, PhD Fellow, Pediatric Critical Care Medicine Morgan Stanley Children's Hospital of NewYork-Presbyterian pager: 82848 ------------------------------------------------- The information contained in this message may be privileged and confidential. If you are NOT the intended recipient, please notify the sender immediately and destroy this message. Please be aware that email communication can be intercepted in transmission or misdirected. Your use of email to communicate protected health information to us indicates that you acknowledge and accept the possible risks associated with such communication. Please consider communicating any sensitive information by telephone, fax or mail. If you do not wish to have your information sent by email, please contact the sender immediately. -------------------------------------------------

moe1619 · 2021-05-18T20:14:32Z

OK! I made it through! Thank you for all of your help! A few outstanding issues but I think we should close this issue because I'm not sure I'll have time to follow up in the near future.

I never got the analysis working with BAM files. I had to use the gene count matrix from STAR. I have to explore why summarizeOverlaps return 0. There's probably a mistake in my alignment.
I had to change the error threshold to 0 from 1 on line 93 of MonoallelicExpression.py to get past this check. I'm not sure why the MAE pipeline is called.

Ok, thank you! I'm sure I'll have more questions in the future.

moe1619 closed this as completed May 24, 2021

moe1619 mentioned this issue Jun 17, 2021

minSample enforcement is ignoring external counts #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

moe1619 commented May 15, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

vyepez88 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021 •

edited

Loading

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021 •

edited

Loading

moe1619 commented May 17, 2021 •

edited

Loading

nickhsmith commented May 18, 2021 •

edited

Loading

moe1619 commented May 18, 2021

vyepez88 commented May 18, 2021

nickhsmith commented May 18, 2021 •

edited

Loading

moe1619 commented May 18, 2021 via email

moe1619 commented May 18, 2021

Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

Too few IDs in DROP_GROUP when combined with public data resource, mae and aberrantExpression #193

Comments

moe1619 commented May 15, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

vyepez88 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

moe1619 commented May 17, 2021

check number of IDs per group

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021 • edited Loading

moe1619 commented May 17, 2021

nickhsmith commented May 17, 2021

moe1619 commented May 17, 2021 • edited Loading

moe1619 commented May 17, 2021 • edited Loading

nickhsmith commented May 18, 2021 • edited Loading

moe1619 commented May 18, 2021

vyepez88 commented May 18, 2021

nickhsmith commented May 18, 2021 • edited Loading

moe1619 commented May 18, 2021 via email

moe1619 commented May 18, 2021

nickhsmith commented May 17, 2021 •

edited

Loading

moe1619 commented May 17, 2021 •

edited

Loading

moe1619 commented May 17, 2021 •

edited

Loading

nickhsmith commented May 18, 2021 •

edited

Loading

nickhsmith commented May 18, 2021 •

edited

Loading