Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long run time with *_CNV_CALLS_pre_filtered.bed #35

Open
WeijiaSu opened this issue Mar 7, 2023 · 4 comments
Open

Long run time with *_CNV_CALLS_pre_filtered.bed #35

WeijiaSu opened this issue Mar 7, 2023 · 4 comments
Labels

Comments

@WeijiaSu
Copy link

WeijiaSu commented Mar 7, 2023

Hi Jens,
I am using AmpliconSuite for a set of cancer WGS data. I have 36 samples, most of them finished successfully. But there are 10 of them that have been running for 5 days. And still not finished yet. I checked the *_CNV_CALLS_pre_filtered.bed. I think for these ones, they have 50-100 entries in the bed files. I wonder if this was the problem. I read the README, it says "if you notice there are > 50 CNV seeds going into AA, there may be something wrong." I assume, the bed files are a little large, but <100 entries are still on a reasonable scale?
If this is the issue, do you think it is ok to re-run AA for these 10 samples and split their bed files into two (so there are <50 entries)?
My command line is:

$AASuite"PrepareAA.py" -s $name -t 32 --cnvkit_dir /anaconda3/bin/cnvkit.py --fastqs $name"_R1.fastq.gz" $$name"_R2.fastq.gz" --ref hg38 --cnsize_min 500 --downsample -1 --run_AA --run_AC

I used the same command line for all 36 samples. And all the samples have similar fastq sizes as input.

Thanks for your help.
Weijia

@jluebeck
Copy link
Member

jluebeck commented Mar 7, 2023

Hi Weijia,

The advice about more than >50 entries in the seeds file applies to the AA_CNV_SEEDS.bed file.

It would appear the problem though is the --cnsize_min argument. 500bp is far below the threshold at which AA functions. At minimum, --cnsize_min 10000 should be used. AA is not designed to detect "eccDNA" - or small ecDNAs below 10kbp. If you are going down to, say --cnsize_min 5000, I would recommend raising --cngain to 7 or 8. Below that size however, I cannot give much guidance on how well the tool functions, or on how reliable the outputs are.

Jens

@WeijiaSu
Copy link
Author

WeijiaSu commented Mar 9, 2023

Thank you, Jens. I followed your suggestions and changed the --cnsize_min to 10000, and these several samples have been running for 2 days (from the bam inputs) without any output. Their SEEDS.bed files have 0-15 entries. Not sure why only these samples are running particularly longer than others.
Also, I am wondering if you have suggestions about the input fastq size. My input fastq files are 100X coverage (hg38). I am wondering if the coverage is unnecessarily large for detecting ecDNA using AA.
Thanks for your help!

@jluebeck
Copy link
Member

jluebeck commented Mar 9, 2023

Hi Weijia,

Are the samples producing any outputs in log files or stdout? If there is no logging output then there may be a problem.

We've found that using anything above 40-50x coverage has little impact on discovered ecDNA, and almost always 5-10x is perfectly adequate unless the ecDNA are highly subclonal. AA's threshold for SV detection scales with baseline coverage, so you are likely not missing critical reads by downsampling, since the original coverage will have a higher threshold to discover the SVs anyways. I recommend that you try leaving the original downsample parameter for your long run-time samples to see if that helps.

--cnsize_min 50000 may also help you avoid issues. If AA hits areas of low complexity or repetitive sequence it may slow down. These show up with increasing frequency at smaller cutoffs. 50000 is very adequate for nearly all ecDNAs since any smaller segments will be recruited to the ecDNA structure when AA explores and recruits the other parts of the genome the seed region is connected to on its own.

Leaving the AA parameters as default at first is highly recommended, then you can customize them further if you feel there are things AA is missing.

For maybe 5% of samples we usually encounter, runtimes are extreme because of the complex nature of the focal amplfication. You may need to wait a week or so for a few "bad" samples to finish.

@WeijiaSu
Copy link
Author

WeijiaSu commented Mar 9, 2023

That is very reasonable. Thank you for your help!
Weijia

@jluebeck jluebeck added the FAQ label Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants