-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add check to ignore genome(s) that cannot be up downloaded #277
Comments
see also same problem in two other pipelines 😭 - dib-lab/charcoal#235 |
Apparently the error is solved if you delete the row with the problematic genome from the {output_folder}/gather/{sample}.gather.csv.gz. Just need to gunzip it, remove the row, and gzip it again. |
yep. however, that then has the effect of ignoring any other genome (or genomes) that would have been chosen in the absence of the problematic genome. e.g. if there's a specific E. coli genome that is no longer available, by removing it from the gather output, you are probably eliminating all the E. coli genomes. the fix I have in mind (but need to find time to implement robustly, and test) would exclude the specific problematic genome from the search, while allowing other related genomes that are NOT problematic to be included. I believe you could mimic that here by removing the problematic genome from the prefetch file, rather than the gather file. |
I have come with a solution to ignore the problematic genomes.
Usage:
An advantage is that we do not need to specify the ignore genomes parameter in the configuration, if will never run into this problem when downloading them. The process take a few minutes as it need to check each prefetch genome separately. My assumption is that if a genome is present in the prefetch list, there is likely another closely related genome also in the list. so even if we don't get the best match, we will have a decent match. This assumption probably holds better when using the full database and not a dereplicated one. I have used the same python libraries that grist scripts use so there should not be a major compatibility issue. |
Re-opening the issue below as a new issue. I am having the same issue. Help would be greatly appreciated.
These rules correctly ignore the missing genome specified in the yaml:
The first rule that is creating an error is extract_leftover_reads_wc. I checked its code and it seems that it uses as input the gather_csv file but it does not check for the flagged genomes in the python script substract_gather.py
These other rules also used that csv as input
make_gather_notebook_wc - > Uses papermill and report-gather.ipynb
make_mapping_notebook_wc -> Uses papermill and report-mapping.ipynb
.
A possible solution would be to pass as an argument the list of flagged genomes (IGNORE_IDENTS) to the python script when it is loading the list of genomes from the csv
Line 29:
I don't know enough about python notebooks to suggest a solution there.
Originally posted by @carden24 in #241 (comment)
The text was updated successfully, but these errors were encountered: