Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge v0.7.0 to main #111

Merged
merged 43 commits into from
Aug 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
019a9bf
Merge pull request #102 from rhysnewell/main
rhysnewell Jul 26, 2023
1adb98a
add test for recover_mags_no_singlem workflow
AroneyS Jul 27, 2023
705d970
add refinery max iterations argument
AroneyS Jul 28, 2023
1c74d1d
add test runner on pull_request
AroneyS Jul 28, 2023
de2b358
add bin check to checkm_metabat2 and checkm_semibin
AroneyS Jul 28, 2023
f7cdfb4
skip checkm_{binner} rules when refinery iterations 0
AroneyS Jul 28, 2023
a475904
add note about iterations=0
AroneyS Jul 28, 2023
cb0b0b6
set max iterations type to int
AroneyS Jul 28, 2023
0ff4bc3
add more friendly message when skipping refinery
AroneyS Jul 28, 2023
1a699ac
Merge pull request #103 from AroneyS/add-refinery-max-iterations-arg
AroneyS Jul 28, 2023
f1cd6c5
add refinery_max_iterations to checkm rules
AroneyS Jul 31, 2023
79996a0
update to semibin2
AroneyS Jul 31, 2023
f5424ba
ensure bins are copied when refinement is skipped
AroneyS Jul 31, 2023
43cd0ee
rework skip binners test
AroneyS Jul 31, 2023
08f1def
fix short recovery only test
AroneyS Jul 31, 2023
520dedd
Merge pull request #104 from rhysnewell/update-to-semibin2
rhysnewell Aug 1, 2023
35cfd3f
fix checkm subprocess
AroneyS Aug 1, 2023
5857170
fix test
AroneyS Aug 1, 2023
0749aa4
revert shell=True and fix subprocess call
AroneyS Aug 1, 2023
aa90293
Merge pull request #105 from rhysnewell/fix-checkm-subprocess
rhysnewell Aug 1, 2023
6b1abf7
fix: removal of instances of shell=True
Aug 3, 2023
e748b34
put shell=True back into main process command to debug
Aug 3, 2023
87e75c4
put shell=True back into main process command to debug
Aug 3, 2023
3a05d71
convert to run with no shell=True
Aug 3, 2023
d3bd50c
convert to run with no shell=True
Aug 3, 2023
b63a6b4
fix: initial subprocess command, removed quotes around config.yaml path
Aug 4, 2023
8a50ac0
update complex piped subprocess commands, needs testing
Aug 4, 2023
6d77ed4
replace commands in polish.py, needs testing
Aug 5, 2023
ee12734
all relevant instances of shell=True converted
Aug 5, 2023
87baa7e
fix #108 update concoct recipe to limit sklean version
Aug 8, 2023
54623cd
fix: set TMPDIR env var before subprocess call. remove hardcoded cond…
Aug 8, 2023
6d9befd
fix: tests and graceful exits on singlem. Move finalise stats
Aug 9, 2023
8ab73c7
revert: add conda-prefix to test commands
Aug 9, 2023
a85d53d
fix: typo finalize -> finalise
Aug 9, 2023
5975815
add long-read integration test
AroneyS Aug 9, 2023
24d714e
cleanup pdb call
AroneyS Aug 9, 2023
df84291
fix: polish.py and pipeing commands
rhysnewell Aug 10, 2023
44d7ae8
fix: add bam index creation in map_reads_ref rule
rhysnewell Aug 11, 2023
8347dfd
assembly not expected since it is input
AroneyS Aug 11, 2023
13d98fb
fix: singlem script no longer uses glob
rhysnewell Aug 11, 2023
17f55db
Merge pull request #107 from rhysnewell/sanitise_subprocess_calls
rhysnewell Aug 11, 2023
75703f7
update: v0.7.0
rhysnewell Aug 11, 2023
3061a2e
Merge pull request #110 from rhysnewell/v0.7.0
rhysnewell Aug 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/test-aviary.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: Test Aviary with Setup-Miniconda From Marketplace
on: [push]
on: [push, pull_request]

jobs:
miniconda:
Expand Down Expand Up @@ -34,3 +34,5 @@ jobs:
run: |
aviary -h
python test/test_assemble.py
python test/test_recover.py
python test/test_run_checkm.py -b
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ binsnek.egg-info
aviary_genome.egg-info

example/
test/data/.conda
2 changes: 1 addition & 1 deletion aviary/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.6.0"
__version__ = "0.7.0"
9 changes: 8 additions & 1 deletion aviary/aviary.py
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,6 @@ def main():
'-r', '--reference-filter', '--reference_filter',
help='Reference filter file to aid in the assembly',
dest="reference_filter",
nargs=1,
default='none'
)

Expand Down Expand Up @@ -494,6 +493,14 @@ def main():
default='global'
)

binning_group.add_argument(
'--refinery-max-iterations', '--refinery_max_iterations',
help='Maximum number of iterations for Rosella refinery. Set to 0 to skip refinery.',
dest='refinery_max_iterations',
type=int,
default=5
)

binning_group.add_argument(
'--skip-binners', '--skip_binners', '--skip_binner', '--skip-binner',
help='Optional list of binning algorithms to skip. Can be any combination of: \n'
Expand Down
44 changes: 30 additions & 14 deletions aviary/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,23 @@ def get_software_db_path(db_name='CONDA_ENV_PATH', software_flag='--conda-prefix
signal.alarm(120)
os.environ[db_name] = input(f'Input path to directory for {db_name} now:').strip()
try:
subprocess.Popen(
'mkdir -p %s/etc/conda/activate.d/; mkdir -p %s/etc/conda/deactivate.d/; echo "export %s=%s" >> %s/etc/conda/activate.d/aviary.sh; echo "unset %s" >> %s/etc/conda/deactivate.d/aviary.sh; ' %
(os.environ['CONDA_PREFIX'], os.environ['CONDA_PREFIX'], db_name, os.environ[db_name],
os.environ['CONDA_PREFIX'], db_name, os.environ['CONDA_PREFIX']), shell=True).wait()
conda_prefix = os.environ['CONDA_PREFIX']
# make the directory
os.makedirs(f"{conda_prefix}/etc/conda/activate.d/", exist_ok=True)
os.makedirs(f"{conda_prefix}/etc/conda/deactivate.d/", exist_ok=True)
# add the export to the activate script
with open(f"{conda_prefix}/etc/conda/activate.d/aviary.sh", 'a') as f:
f.write(f'export {db_name}={os.environ[db_name]}\n')

# add the unset to the deactivate script
with open(f"{conda_prefix}/etc/conda/deactivate.d/aviary.sh", 'a') as f:
f.write(f'unset {db_name}\n')

except KeyError:
subprocess.Popen(
'echo "export %s=%s" >> ~/.bashrc' %
(db_name, os.environ[db_name]), shell=True).wait()
# put the export in the bashrc
with open(f"{os.environ['HOME']}/.bashrc", 'a') as f:
f.write(f'export {db_name}={os.environ[db_name]}\n')

signal.alarm(0)
print('=' * 100)
# print('Reactivate your aviary conda environment or source ~/.bashrc to suppress this message.'.center(100))
Expand All @@ -105,11 +114,18 @@ def get_software_db_path(db_name='CONDA_ENV_PATH', software_flag='--conda-prefix
def set_db_path(path, db_name='CONDA_ENV_PATH'):
os.environ[db_name] = path.strip()
try:
subprocess.Popen(
'mkdir -p %s/etc/conda/activate.d/; mkdir -p %s/etc/conda/deactivate.d/; echo "export %s=%s" >> %s/etc/conda/activate.d/aviary.sh; echo "unset %s" >> %s/etc/conda/deactivate.d/aviary.sh; ' %
(os.environ['CONDA_PREFIX'], os.environ['CONDA_PREFIX'], db_name, os.environ[db_name],
os.environ['CONDA_PREFIX'], db_name, os.environ['CONDA_PREFIX']), shell=True).wait()
conda_prefix = os.environ['CONDA_PREFIX']
# make the directory
os.makedirs(f"{conda_prefix}/etc/conda/activate.d/", exist_ok=True)
os.makedirs(f"{conda_prefix}/etc/conda/deactivate.d/", exist_ok=True)
# add the export to the activate script
with open(f"{conda_prefix}/etc/conda/activate.d/aviary.sh", 'a') as f:
f.write(f'export {db_name}={os.environ[db_name]}\n')

# add the unset to the deactivate script
with open(f"{conda_prefix}/etc/conda/deactivate.d/aviary.sh", 'a') as f:
f.write(f'unset {db_name}\n')
except KeyError:
subprocess.Popen(
'echo "export %s=%s" >> ~/.bashrc' %
(db_name, os.environ[db_name]), shell=True).wait()
# put the export in the bashrc
with open(f"{os.environ['HOME']}/.bashrc", 'a') as f:
f.write(f'export {db_name}={os.environ[db_name]}\n')
29 changes: 19 additions & 10 deletions aviary/modules/assembly/assembly.smk
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ rule map_reads_ref:
reference_filter = config["reference_filter"]
group: 'assembly'
output:
temp("data/raw_mapped_ref.bam")
temp("data/raw_mapped_ref.bam"),
temp("data/raw_mapped_ref.bai")
conda:
"../../envs/coverm.yaml"
benchmark:
Expand All @@ -58,7 +59,7 @@ rule map_reads_ref:
threads:
config["max_threads"]
shell:
"minimap2 -ax {params.mapper} --split-prefix=tmp -t {threads} {input.reference_filter} {input.fastq} | samtools view -@ {threads} -b > {output}"
"minimap2 -ax {params.mapper} --split-prefix=tmp -t {threads} {input.reference_filter} {input.fastq} | samtools view -@ {threads} -b > {output} && samtools index {output}"


# Get a list of reads that don't map to genome you want to filter
Expand All @@ -82,7 +83,7 @@ rule get_umapped_reads_ref:
rule get_reads_list_ref:
input:
fastq = config["long_reads"],
list = "data/unmapped_to_ref.list"
unmapped_list = "data/unmapped_to_ref.list"
group: 'assembly'
output:
temp("data/long_reads.fastq.gz")
Expand All @@ -93,7 +94,7 @@ rule get_reads_list_ref:
benchmark:
"benchmarks/get_reads_list_ref.benchmark.txt"
shell:
"seqtk subseq {input.fastq} {input.list} | pigz -p {threads} > {output}"
"seqtk subseq {input.fastq} {input.unmapped_list} | pigz -p {threads} > {output}"

# if no reference filter output this done file just to keep the DAG happy
rule no_ref_filter:
Expand Down Expand Up @@ -163,9 +164,9 @@ rule filter_illumina_ref:
reference_filter = config["reference_filter"]
group: 'assembly'
output:
bam = temp("data/short_unmapped_ref.bam"),
fastq = temp("data/short_reads.fastq.gz"),
filtered = temp("data/short_filter.done")
bam = "data/short_unmapped_ref.bam",
fastq = "data/short_reads.fastq.gz",
filtered = "data/short_filter.done"
params:
coassemble = config["coassemble"]
conda:
Expand Down Expand Up @@ -641,7 +642,7 @@ rule combine_assemblies:
flye_fasta = "data/flye_high_cov.fasta"
group: 'assembly'
output:
fasta = "data/final_contigs.fasta",
output_fasta = "data/final_contigs.fasta",
priority: 1
threads:
config["max_threads"]
Expand All @@ -653,10 +654,10 @@ rule combine_assemblies:
rule combine_long_only:
input:
long_reads = "data/long_reads.fastq.gz",
fasta = "data/assembly.pol.rac.fasta"
input_fasta = "data/assembly.pol.rac.fasta"
group: 'assembly'
output:
fasta = "data/final_contigs.fasta",
output_fasta = "data/final_contigs.fasta",
# long_bam = "data/final_long.sort.bam"
priority: 1
conda:
Expand Down Expand Up @@ -727,6 +728,10 @@ rule complete_assembly:
'ln -s ../data/final_contigs.fasta ./; '
'cd ../;'
'rm -rf data/polishing; '
'rm -rf data/short_reads.fastq.gz; '
'rm -rf data/short_unmapped_ref.bam; '
'rm -rf data/short_unmapped_ref.bam.bai; '
'rm -rf data/short_filter.done; '

rule complete_assembly_with_qc:
input:
Expand All @@ -744,6 +749,10 @@ rule complete_assembly_with_qc:
'ln -s ../data/final_contigs.fasta ./; '
'cd ../;'
'rm -rf data/polishing; '
'rm -rf data/short_reads.fastq.gz; '
'rm -rf data/short_unmapped_ref.bam; '
'rm -rf data/short_unmapped_ref.bam.bai; '
'rm -rf data/short_filter.done; '

rule reset_to_spades_assembly:
output:
Expand Down
153 changes: 85 additions & 68 deletions aviary/modules/assembly/scripts/assemble_pools.py
Original file line number Diff line number Diff line change
@@ -1,79 +1,96 @@
import os
import subprocess

def assemble_pools(
input_list: str,
input_fasta: str,
output_fasta: str,
metabat_done: str,
threads: int):
"""
Assemble metagenome bins using unicycler
"""

try:
os.makedirs("data/final_assemblies")
except FileExistsError:
pass

try:
os.makedirs("data/final_assemblies")
except FileExistsError:
pass


out_assemblies = []
with open(snakemake.input.list) as f:
for line in f:
if len(line.split()) == 6:
mb_bin, long_reads, length, bases_nano, short_reads, bases_ill = line.split()
else:
mb_bin, long_reads, length, bases_nano = line.split()
short_reads, bases_ill = 'none', 0
if os.stat(long_reads).st_size == 0:
no_long = True
else:
no_long = False
long_reads = long_reads[:-5] + '.fastq.gz'
long_reads = os.path.abspath(long_reads)
short_reads_1 = short_reads[:-5] + '.1.fastq.gz'
short_reads_2 = short_reads[:-5] + '.2.fastq.gz'
short_reads_1 = os.path.abspath(short_reads_1)
short_reads_2 = os.path.abspath(short_reads_2)
length, bases_nano, bases_ill = float(length), float(bases_nano), float(bases_ill)
out_assemblies.append('data/final_assemblies/%s_unicyc/assembly.fasta' % mb_bin)
if not os.path.exists('data/final_assemblies/%s_unicyc/assembly.fasta' % mb_bin):
if short_reads == 'none':
subprocess.Popen("unicycler --verbosity 0 -t %d -l %s -o data/final_assemblies/%s_unicyc" % (
snakemake.threads, long_reads, mb_bin), shell=True).wait()
elif no_long:
subprocess.Popen("unicycler --verbosity 0 -t %d -1 %s -2 %s -o data/final_assemblies/%s_unicyc" % (
snakemake.threads, short_reads_1, short_reads_2, mb_bin), shell=True).wait()
out_assemblies = []
with open(input_list) as f:
for line in f:
if len(line.split()) == 6:
mb_bin, long_reads, length, bases_nano, short_reads, bases_ill = line.split()
else:
mb_bin, long_reads, length, bases_nano = line.split()
short_reads, bases_ill = 'none', 0
if os.stat(long_reads).st_size == 0:
no_long = True
else:
subprocess.Popen("unicycler --verbosity 0 -t %d -1 %s -2 %s -l %s -o data/final_assemblies/%s_unicyc" % (
snakemake.threads, short_reads_1, short_reads_2, long_reads, mb_bin), shell=True).wait()
no_long = False
long_reads = long_reads[:-5] + '.fastq.gz'
long_reads = os.path.abspath(long_reads)
short_reads_1 = short_reads[:-5] + '.1.fastq.gz'
short_reads_2 = short_reads[:-5] + '.2.fastq.gz'
short_reads_1 = os.path.abspath(short_reads_1)
short_reads_2 = os.path.abspath(short_reads_2)
length, bases_nano, bases_ill = float(length), float(bases_nano), float(bases_ill)
out_assemblies.append('data/final_assemblies/%s_unicyc/assembly.fasta' % mb_bin)
if not os.path.exists('data/final_assemblies/%s_unicyc/assembly.fasta' % mb_bin):
if short_reads == 'none':
subprocess.run(f"unicycler --verbosity 0 -t {threads} -l {long_reads} -o data/final_assemblies/{mb_bin}_unicyc".split())
elif no_long:
subprocess.run(f"unicycler --verbosity 0 -t {threads} -1 {short_reads_1} -2 {short_reads_2} -o data/final_assemblies/{mb_bin}_unicyc".split())
else:
subprocess.run(f"unicycler --verbosity 0 -t {threads} -1 {short_reads_1} -2 {short_reads_2} -l {long_reads} -o data/final_assemblies/{mb_bin}_unicyc".split())

unbinned_set = set()
if os.path.exists(snakemake.input.metabat_done[:-4] + "binned_contigs.unbinned"):
with open(snakemake.input.metabat_done[:-4] + "binned_contigs.unbinned") as f:
for line in f:
unbinned_set.add(line.rstrip())
unbinned_set = set()
if os.path.exists(metabat_done[:-4] + "binned_contigs.unbinned"):
with open(metabat_done[:-4] + "binned_contigs.unbinned") as f:
for line in f:
unbinned_set.add(line.rstrip())


with open(snakemake.output.fasta, 'w') as o:
count = 0
getseq = False
with open(snakemake.input.fasta) as f:
for line in f:
if line.startswith('>') and line.split()[0][1:] in unbinned_set:
getseq = True
o.write('>unbinned_' + str(count) + '\n')
count += 1
elif line.startswith('>'):
getseq = False
elif getseq:
o.write(line)
for i in out_assemblies:
if not os.path.exists(i):
with open(i[:-14] + 'unicycler.log') as f:
lastline = f.readlines()[-1]
if lastline.startswith("Error: SPAdes failed to produce assemblies. See spades_assembly/assembly/spades.log for more info.") or \
lastline.startswith("Error: none of the SPAdes graphs were suitable for scaffolding in Unicycler") or \
lastline.startswith("Error: miniasm assembly failed"):
continue
with open(i) as assembly:
for line in assembly:
if line.startswith('>'):
with open(output_fasta, 'w') as o:
count = 0
getseq = False
with open(input_fasta) as f:
for line in f:
if line.startswith('>') and line.split()[0][1:] in unbinned_set:
getseq = True
o.write('>unbinned_' + str(count) + '\n')
count += 1
o.write('>unicycler_' + str(count) + '\n')
else:
elif line.startswith('>'):
getseq = False
elif getseq:
o.write(line)
for i in out_assemblies:
if not os.path.exists(i):
with open(i[:-14] + 'unicycler.log') as f:
lastline = f.readlines()[-1]
if lastline.startswith("Error: SPAdes failed to produce assemblies. See spades_assembly/assembly/spades.log for more info.") or \
lastline.startswith("Error: none of the SPAdes graphs were suitable for scaffolding in Unicycler") or \
lastline.startswith("Error: miniasm assembly failed"):
continue
with open(i) as assembly:
for line in assembly:
if line.startswith('>'):
count += 1
o.write('>unicycler_' + str(count) + '\n')
else:
o.write(line)

if not os.path.exists(output_fasta):
open(output_fasta, 'a').close()



if __name__ == '__main__':
input_list = snakemake.input.list
input_fasta = snakemake.input.fasta
output_fasta = snakemake.output.fasta
metabat_done = snakemake.input.metabat_done

threads = snakemake.threads

if not os.path.exists(snakemake.output.fasta):
open(snakemake.output.fasta, 'a').close()
assemble_pools(input_list, input_fasta, output_fasta, metabat_done, threads)
Loading
Loading