Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many follow-ons which gbwt pruning with single vcf and many regions #700

Open
glennhickey opened this issue Jan 30, 2019 · 0 comments
Open

Comments

@glennhickey
Copy link
Collaborator

Running toil-vg index with something like --vcf my.vcf.gz --fasta_regions will assume my.vcf.gz should be associated with each fasta region. This is harmless in itself (and often what we want when passing in a whole-genome vcf) but when --gbwt_prune is used, then there can be trouble.

Since the pruning has to be done in series to incrementally update the mapping_id, then toil-vg makes a follow-on chain, the length of which is the number of sequences in the fasta. In hs38d1, this can be very long, and I've run into a stack overflow in getRootJobs() in Toil's job.py.

The work around is to do something like --vcf "$(for i in $(seq 1 22; echo X; echo Y); do echo my.vcf.gz; done)" to apply the input VCF only to sequences that it covers. But it would be nicer if this were more transparent (perhaps scanning the VCF at first to see which chromosomes it covers using accordingly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant