Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q. Related to RealignReads.py - contig start and end value #110

Closed
quito418 opened this issue May 25, 2022 · 5 comments
Closed

Q. Related to RealignReads.py - contig start and end value #110

quito418 opened this issue May 25, 2022 · 5 comments

Comments

@quito418
Copy link

quito418 commented May 25, 2022

Hi,

I think reference region (reference_start ~ reference_end) should be larger than the reads region (extend_start ~ extend_end).

Given ctg_start and ctg_end, extend_start and extend_end is defined as ctg_start - max_window_size, ctg_end + max_window_size

If I am correct shouldn't reference_start and reference_end should be each extend_start - expandReferenceRegion, extend_end + expandReferenceRegion ?

(I encounter an error, where reference_sequence is shorter than reference_position inside samtools_view_generator_from function due to this)

Thank you!

reference_start, reference_end = ctg_start - param.expandReferenceRegion, ctg_end + param.expandReferenceRegion

@zhengzhenxian
Copy link
Collaborator

Hi,

expandReferenceRegion and max_window_size have different purposes. expandReferenceRegion is for minimizing the influence of coverage decay at reference boundaries, while max_window_size is for maximizing the performance of short read realignment.

Where did the samtools_view_generator_from error occur, was it from calling or training?

@quito418
Copy link
Author

quito418 commented May 26, 2022

I encountered error in this line

reference_base = reference_sequence[reference_position - reference_start_0_based] # 0 base

The reference_position - reference_start_0_based got larger than the reference_sequence array.

I was running the below command to realign Illumina reads (150bp)

${PARALLEL} --joblog ${DATASET_FOLDER_PATH}/realignreads.log -j${THREADS_LOW} \
"${PYPY} ${CLAIR3} RealignReads \
    --bam_fn {4} \
    --ref_fn {5} \
    --read_fn ${PHASE_BAM_PATH}/_{2}_{3}_{1} \
    --ctgName ${CHR_PREFIX}{1} \
    --samtools ${SAMTOOLS} \
    --chunk_id {6} \
    --chunk_num ${chunk_num}" ::: ${CHR[@]} ::: ${ALL_SAMPLE[@]} :::+ ${DEPTHS[@]} :::+ ${ALL_UNPHASED_BAM_FILE_PATH[@]} :::+ ${ALL_REFERENCE_FILE_PATH[@]} ::: ${CHUNK_LIST[@]}

so I suspected reference region was not large enough to cover up read region

@zhengzhenxian
Copy link
Collaborator

Thanks for the details. Could you please try to set the default expandReferenceRegion to 10000 in param_f.py to check if the error persists? It would slightly increase the memory usage while performing the same.

@quito418
Copy link
Author

Sounds great, I will try it out and share the result here!

Thanks for the prompt reply

@quito418
Copy link
Author

I confirmed that it works fine after changing the expandReferenceRegion number!

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants