Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on GATK4_BEDTOINTERVALLIST due to incorrect exome.bed generation from iGenomes References. #112

Open
Shaun-Regenbaum opened this issue Dec 26, 2023 · 5 comments · May be fixed by #114
Labels
bug Something isn't working

Comments

@Shaun-Regenbaum
Copy link

Shaun-Regenbaum commented Dec 26, 2023

Description of the bug

Then GATK4_BEDTOINTERVALLIST sometimes fails when using a variety of references genomes due to the incorrect creation of the genome.dict or exome.bed file from the reference GTF files. This results in a sequence dictionary mismatch between the two which leads the step to fail.

Command used and terminal output

No response

Relevant files

No response

System information

No response

@Shaun-Regenbaum Shaun-Regenbaum added the bug Something isn't working label Dec 26, 2023
@Shaun-Regenbaum
Copy link
Author

I am going to do some more exploration of this, and hopefully submit a PR with a fix this week.

@Shaun-Regenbaum Shaun-Regenbaum changed the title Incorrect GATK4_BEDTOINTERVALLIST Generation from iGenomes References. Failure on GATK4_BEDTOINTERVALLIST due to incorrect exome.bed generation from iGenomes References. Dec 26, 2023
@Shaun-Regenbaum
Copy link
Author

I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:

chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129

@Shaun-Regenbaum
Copy link
Author

My fix was to add a workflow step that simply filters the exome.bed file by what chromosomes are defined by the genome.dict file. It shouldn't affect other pipelines and should just allow the pipeline to handle a greater variety of refrence genomes/species.

@maxulysse
Copy link
Member

I love this idea, that's an amazing addition

@SAADAT-Abu
Copy link

I have a working fork that I think I fixed the issue on. In short this issue would arise when the exome.bed file contained non standard or unplaced chromosomal sequences which can happen quite often in non human genomes, for example:

chrUn_GJ060129v1 3730 4217 chrUn_GJ060129v1 5192 5333 chrUn_GJ060129v1 5806 6353 chrUn_GJ060163v1 0 311 chrUn_GJ060163v1 741 1129

I am using your version with all default parameters and got this error

Command error:
  Using GATK jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx30g -jar /usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar BedToIntervalList --INPUT exome.bed --OUTPUT genome.bed.interval_list --SEQUENCE_DICTIONARY genome.dict --TMP_DIR .
  14:48:55.685 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/usr/local/share/gatk4-4.2.6.1-0/gatk-package-4.2.6.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
  [Wed Jul 17 14:48:55 GMT 2024] BedToIntervalList --INPUT exome.bed --SEQUENCE_DICTIONARY genome.dict --OUTPUT genome.bed.interval_list --TMP_DIR . --SORT true --UNIQUE false --DROP_MISSING_CONTIGS false --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

I would be glad if you can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants