Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running Amplicon with diversity option #83

Open
XU-Nuo opened this issue May 24, 2021 · 3 comments
Open

Error running Amplicon with diversity option #83

XU-Nuo opened this issue May 24, 2021 · 3 comments
Assignees

Comments

@XU-Nuo
Copy link

XU-Nuo commented May 24, 2021

Shorah Version:

1.99.2

Command to reproduce:

$ shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d

I have trimmed my read length to equal length prior to this step.

Expected Behavior:

The analysis outputs the entropy.pdf and entropy.csv besides everything else when running without the -d option.

Actual Behavior:

The entropy.pdf and entropy.csv are generated, but the software terminates with error:

[mpileup] 1 samples in 1 input files
/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py:255: UserWarning: mpileup column not fully parsed
  warnings.warn('mpileup column not fully parsed')
Traceback (most recent call last):
  File "/opt/conda/envs/python3/bin/shorah", line 14, in <module>
    main()
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 210, in main
    args.func(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/cli.py", line 89, in amplicon_run
    amplicon.main(args)
  File "/opt/conda/envs/python3/lib/python3.9/site-packages/shorah/amplicon.py", line 387, in main
    h = list(open('coverage.txt'))[0]
IndexError: list index out of range

Output from shorah.log

INFO 2021/05/24 13:36:25 cli.py: main() 204: 	shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
INFO 2021/05/24 13:36:25 cli.py: main() 205: 	shorah version:1.99.2

INFO 2021/05/24 13:36:25 amplicon.py: main() 352: 	shorah amplicon  -b my_sorted_bam.bam -f amplicon.fasta -d
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: 	samtools view my_sorted_bam.bam | cut -f 10 > rl.txt
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: 	Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 229: 	n_reads: 500
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 235: 	trimmed_mean: 388
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 90: 	samtools mpileup  -f amplicon.fasta -d 10000 my_sorted_bam.bam > sample.mpu
DEBUG 2021/05/24 13:36:25 amplicon.py: run_child() 99: 	Child samtools returned 0
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 272: 	start: 1, stop: 388
INFO 2021/05/24 13:36:25 amplicon.py: highest_entropy() 286: 	highest entropy found at position 22
.
.
.
INFO 2021/05/24 13:36:25 amplicon.py: main() 372: 	analysing region from -1 to 387

I have traced back in the code and it turns out that the region index return by function highest_entropy() in src/shorah/amplicon.pydoes not match my expectation. When all my reads have the same length, it will skip the steps finding the max entropy (

for i in range(rsta, rsto):
to line 304) and returns -1 for the start index of region. While from my understanding towards the code, the finding entropy part should at least be iterated one time. I found that's because the value of rsto is lower that rsta in this case. So I tried to fix this issue by changing line 298 from

rsto = min(stop - trimmed_mean, highest_ent_pos + 1)

to

rsto = min(stop - trimmed_mean + 2, highest_ent_pos + 1)

Also, in line 305, the high_ent_stop should ends where the region ends as the return value to match the the logic invoked the function, so I subtract 1 from high_ent_start + trimmed_mean, from

high_ent_stop = high_ent_start + trimmed_mean

to

high_ent_stop = high_ent_start + trimmed_mean -1

By applying these two changes, I am able to run the amplicon function on my data with equal length with the -d option, as the region is now fixed for this rare margin case. I don't know if I understand this part right and make the correct modification. Would you verify it for me to see if this is a bug?

@niemasd
Copy link

niemasd commented Aug 10, 2021

I'm running into the same error, also on v1.99.2, but even in shorah amplicon mode with default arguments:

$ shorah amplicon -b SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f NC_045512.2.fas
Traceback (most recent call last):
  File "/usr/local/bin/shorah", line 14, in <module>
    main()
  File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 210, in main
    args.func(args)
  File "/usr/local/lib/python3.6/site-packages/shorah/cli.py", line 89, in amplicon_run
    amplicon.main(args)
  File "/usr/local/lib/python3.6/site-packages/shorah/amplicon.py", line 387, in main
    h = list(open('coverage.txt'))[0]
IndexError: list index out of range

So I'm not sure that this error is unique to the -d option, but rather, it seems general to shorah amplicon. Here's the contents of my log file:

INFO 2021/08/10 14:19:35 cli.py: main() 204:    /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 cli.py: main() 205:    shorah version:1.99.2

INFO 2021/08/10 14:19:35 amplicon.py: main() 352:       /usr/local/bin/shorah amplicon -b ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam -f ../NC_045512.2.fas
INFO 2021/08/10 14:19:35 amplicon.py: main() 372:       analysing region from 1 to 29903
DEBUG 2021/08/10 14:19:35 amplicon.py: run_child() 90:  /usr/local/bin/b2w -i 0 -w 29903 -m 28407 -x 10000 -c 0 ../SEARCH-40821__D102066b__I02__210723_A00953_0359_BHGY7MDSX2__003.trimmed.sorted.bam ../NC_045512.2.fas
DEBUG 2021/08/10 14:19:40 amplicon.py: run_child() 99:  Child /usr/local/bin/b2w returned 0
DEBUG 2021/08/10 14:19:40 amplicon.py: main() 380:      b2w returned 0

In addition to the log file, shorah amplicon created the following files, all of which are 0 bytes:

  • coverage.txt
  • reads.fas
  • w-NC_045512.2-1-29903.reads.fas

The edits proposed by XU-Nuo seem to fix the issue for -d mode on my end as well, but not in the default parameters

@DrYak
Copy link
Member

DrYak commented Aug 11, 2021

Thank you for reporting. I hope I'll have time to check this next week...

@DrYak DrYak self-assigned this Aug 11, 2021
@AdmiralenOla
Copy link

Same issue here. I wonder if the problem might be the arguments sent to b2w.

Lines 377-378 in amplicon.py:

    b2w_args = ' -i 0 -w %d -m %d -x %d -c %i%s %s %s %s' % (ref_length, int(
        min_overlap * ref_length), max_coverage, cov_thrd, d, in_bam, in_fasta, region)

I'm not sure I understand what b2w does exactly, but this parameter combination of window length equal to reference length and minimum overlap equal to 95% of that always creates empty output files (coverage.txt and the *reads.fas files) for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants