Releases: broadinstitute/gatk
4.0.9.0
Highlighting this release are some important fixes and improvements to the HaplotypeCaller
, in particular support for genotyping spanning deletions and a fix to the reference confidence calculation around indels. This release also brings support for "Requester Pays" GCS (Google Cloud Storage) buckets, fasta.gz
support to the -R
/--reference
argument, a port of LeftAlignAndTrimVariants
from GATK3, a new tool FuncotatorDataSourceDownloader
to download Funcotator
datasources, and bug fixes to Mutect2
, VariantRecalibrator
, and SelectVariants
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
-
HaplotypeCaller
- Fixed the reference confidence calculation upstream of indels (#5172)
- Improve hom-ref GQs near indels in GVCFs. Also consider bases on either side of indels informative if local assembly has been performed.
- The previous behavior generated some PL=0,0,0 no-calls because the CIGAR of reads containing indels wasn't taken into account when determining which reads were informative for the indel reference confidence model. The local realignment wasn't being used inside the active region previously either, which has been fixed. A related change considers bases on either side of indels informative if local assembly has been performed (but not during active region detection). Both result in far fewer 0,0,0 calls. Unfortunately there are still some 0,0,X homRef calls related to #5171.
- Make HaplotypeCaller genotype and output spanning deletions (#4963)
- Modifies HaplotypeCaller so that it can output and genotype spanning deletion alleles represented by the * allele.
- Fixes #2960
- Previously, the output of HaplotypeCaller would not include spanning deletion alleles when run in single sample VCF mode or in genotype given alleles mode, even when that genotype would be more appropriate. In the joint calling workflow GenotypeGVCFs adds genotypes for spanning deletions, although the input likelihoods will not be broken out to specifically account for spanning deletion alleles.
- Simplify HaplotypeBAMWriter code. #944 (#5122)
- Fixed the reference confidence calculation upstream of indels (#5172)
-
Mutect2
- Mutect2 now emits DP values in the FORMAT field (#5185)
- Add
--get-af-from-ad
option to recalculate the allele fraction based on AD instead of the Bayesian estimate (#5118)- Recommended for mitochondrial applications
- Fixed a
StringIndexOutOfBoundsException
crash in the ReferenceBases annotation when a variant is within 10 base pairs of the end of a chromosome (#5151) - Restore base quality filter code that got removed unintentionally in #4895. (#5123)
- Remove extra space in the
MutectVersion
header line (previously wasMutect Version
) (#5184)
-
Added support for "Requester Pays" GCS (Google Cloud Storage) buckets via new
--gcs-project-for-requester-pays
argument (#5140) -
Added fasta.gz support to the
-R
/--reference
argument in walker tools (#5120) -
Added GCS/NIO support to the
--tmp-dir
argument (#4469) -
Upgraded
google-cloud-java
to the official 0.62.0 release, and move off of our custom fork of the library. This release includes the retry for transient502
errors that we added to our fork in GATK 4.0.8.0 (#5194) (#5135) -
Ported the
LeftAlignAndTrimVariants
tool from GATK3 (#5144) -
VariantRecalibrator
: the serialized model now sets annotation order (#3655)- This addresses a problem where serialized GMMs for VQSR assumed that the annotation order would be the same between the commands that generated them and the commands that used them. VQSR no longer depends on the commandline order of the annotations.
-
SelectVariants
: Drop sites with the * allele as the only ALT when running with--exclude-non-variants
(#5129) -
Funcotator
:- Created a new
FuncotatorDataSourceDownloader
tool to download data sources. (#5150) - Add an experimental
FilterFuncotations
tool (#4991) - Updated COSMIC to annotate protein change strings with their counts. (#5181)
- Fix INDEL start/stop position and alleles for VCF gencode output. (#5131)
- Get datasource version from a manifest file instead of the README (#5149)
- Extract a new
FuncotatorEngine
to make it easier to write additional tools in the future that leverage Funcotator's annotation engine (#5134) - Handle character encoding error cases. (#5124)
- Created a new
-
CNNScoreVariants
: -
CNV tools
: -
SV tools
:- Bug fix to read name mangling in
ExtractOriginalAlignmentRecordsByNameSpark
(#5107) - Added an
InsertSizeDistribution
class to represent expected insert-size distribution (normal and log-normal distributed) parameterized by insert size mean and stddev (#4827) - Added documentation clarification and additional validation to
SVInterval
(#5157) - Test and utils clean up (#5116)
- Bug fix to read name mangling in
-
MarkDuplicatesSpark
: -
Clone read base qualities rather than reference them directly in the read clipper code to prevent unsafe array operations (#4926)
-
Fix three bugs in the
AlignmentUtils
class (#3494)- The treatment of D-over-D in function applyCigarToCigar() was backward.
- In function
createReadAlignedToRef()
the read start position passed to theleftAlignIndel()
call was incorrect if the haplotype has an indel relative to reference. - When the
leftAlignIndel()
call drops any leading D operator in the result cigar, the read start position needs to be adjusted accordingly.
-
Test infrastructure improvements:
-
Documented use of
--temp-dir
withGenomicsDBImport
. (#5047) -
Deleted obsolete experimental tool
MarkDuplicatesGATK
in favor ofMarkDuplicatesSpark
(#5166) -
Deleted obsolete experimental tool
BaseRecalibratorSparkSharded
(#5192) -
Upgraded htsjdk to version 2.16.1 (#5168)
-
Upgraded Picard to version 2.18.13. (#5173)
4.0.8.1
This is a small bug fix release to fix an issue with unpaired reads in Mutect2
, as well as small fixes and improvements to Funcotator
, FilterVariantTranches
, and MarkDuplicatesSpark
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Mutect2
: Fixed a "Cannot get mate information for an unpaired read" error that could occur with certain datasets containing unpaired reads that pass all the M2 read filters and show evidence of a SNV (#5121) -
Funcotator
:- Fixes to the splice site logic. (#5106)
- Funcotator now ignores leading indel bases when checking if variants are within the splice site boundaries (eg. if a leading base in an indel, which is preserved between the reference and alternate alleles, is within the splice site boundary but the bases that have been changed are NOT, then the variant is now correctly labeled as NOT a splice site).
- Populate the DB SNP validation status field properly (#5046)
- Funcotator will now populate the MAF DB SNP Validation status field with proper values (e.g. "by1000genomes") instead of boolean value (e.g. "TRUE")
- Funcotator now handles multiple records in a VCF funcotation factory that have the same pos, ref, and alt combination, even if equivalent and not exact matches.
- Fixes to the splice site logic. (#5106)
-
FilterVariantTranches
: -
Updated
MarkDuplicatesSpark
scoring and comparison code to reflect changes in Picard (#5023)- Updated the scoring code to no longer take into account the unclipped start position of mismatching reads. Also changed the score to be a double packed short value in order to better reflect Picard scoring code.
-
Other Changes:
4.0.8.0
This release features some significant changes to Mutect2
that improve both performance and correctness, as well as a bug fix to GenomicsDBImport
for large interval lists.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Mutect2
- Handle overlapping mates in M2 active region detection, causing fewer false active regions (#5078)
- Makes Mutect2 ~25% faster in many cases with no loss of accuracy!
- Filter M2 calls that are near other filtered calls on the same haplotype (#5092)
- A very effective new filter that significantly reduces false positives
- New Orientation Bias Filter (#4895)
- New, improved orientation bias model, without which the M2 pipeline is not viable for NovaSeq data.
- Changed the default AF slightly for M2 tumor-only mode (just a small tweak) (#5067)
- Optimize some Mutect-related tools (#5073)
- Everything that inherits from
AbstractConcordanceWalker
(this includes theConcordance
tool andMergeMutect2CallsWithMC3
) is now much faster on the cloud
- Everything that inherits from
- Fixed edge case for M2 palindrome transformer (#5080)
- Fixed an edge case involving reads assigned huge fragment lengths
- Allowing counts for supporting alt reads in the validation normal. (#5062)
- Added useful information suggesting possible normal artifacts in somatic validation tool.
- M2 wdl doesn't emit unfiltered vcf, which is redundant (#5076)
- Handle overlapping mates in M2 active region detection, causing fewer false active regions (#5078)
-
GenomicsDBImport
-
Updated
MarkDuplicatesSpark
tie-breaking rules to reflect changes in picard (#5011) -
Added the ability for
CompareDuplicatesSpark
to output mismatching reads (#4894) -
Updated our
google-cloud-java
fork to 0.20.5-alpha-GCS-RETRY-FIX (#5099)- We now retry on 502 and UnknownHostException errors when using NIO
-
SV Tools
:- Various improvements (#4996)
- output a single VCF for new interpretation tool
- bring MAX_ALIGN_LENGTH and MAPPING_QUALITIES annotations from CPX variants to re-interpreted simple variants
- add new CLI argument and filter assembly based variants based on annotation MAPPING_QUALITIES, MAX_ALIGN_LENGTH
- filter out variants of size < 50
- Bug fix for the extreme edge case where after alignments de-overlapping, an alignment block is only 1 base long (#4962)
- Turn back on checking variant info fields against header in SV vcf writing (turned off temporarily long time ago but slipped attention after implementation stablized) (#5084)
- Various improvements (#4996)
4.0.7.0
Some important fixes in this release include a new version of GenomicsDB with a fix for the stack overflow seen when using large interval lists, and an updated Docker image with a fix for the missing R/ggplot2 dependencies.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/.
Docker
- Restore missing R/ggplot2 dependencies on the Docker image. [#5040 (https://github.com//pull/5040)
GenomicsDB
- Fix GenomicsDBImport stack overflow when using large number of intervals #4997
Mutect2
- Don't use very short stubs of clipped reads for genotyping #5057
- Add maxRetries to runtime in M2 WDLs #5049
- Fix an edge case bug in PalindromeArtifactReadTransformer #5038
- Make orientation bias filtering default to true #5019
- Added option for ValidateBasicSomaticShortMutations to output a vcf #4999
- Add Mutect2 PalindromeArtifactReadTransformer to hard clip inverted tandem repeats insertion artifacts #4998
- Making MAF become the output of Funcotator in M2 WDL and multiple transcript fix. #4941
CNV Tools
- Exposed ability to blacklist intervals in CNV WDLs. #5027
- Added output of IGV-compatible .seg files to ModelSegments. #5048
Structural Variants
- Add BreakpointEvidence filter based on classifier #4769
- Address more edge cases in assembly alignments #5044
- Refactor AssemblyContigAlignmentsConfigPicker #4971
- Fix an edge case in assembly contig alignment picker where no good mappings to canonical mappings exist #5005
- Trim down ref bases for CPX variants #4970
Funcotator
- VCF Funcotation Factory will recognize equivalent alleles (even when not exact) #4977
Other
4.0.6.0
Highlights of this release include:
- A new version of
GenomicsDB
that brings many long-requested features such as support for multiple intervals inGenomicsDBImport
- A significantly (~33%) smaller GATK docker image
- An important bug fix for the
-new-qual
option inGenotypeGVCFs
/HaplotypeCaller
/Mutect2
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
GenomicsDB: new version with many long-awaited features and bug fixes (#4645)
- Multi-interval support in
GenomicsDBImport
(#3269)- Now you can specify multiple
-L
intervals when importing variants into GenomicsDB usingGenomicsDBImport
, instead of having to specify one interval per invocation.
- Now you can specify multiple
- New protobuf-based API to allow configuration without editing JSON files
- Support for sites-only queries
- Support for returning the genotype (GT) field in queries
- Fixed bug where records with spanning deletion alleles could cause reads from GenomicsDB to fail (#4716)
- Multi-interval support in
-
Reduced the size of the GATK docker image by approximately 33%, from ~5.3 GB to ~3.5 GB (#4955)
-
Fixed a regression in the
-new-qual
option forGenotypeGVCFs
/HaplotypeCaller
/Mutect2
that was introduced in GATK4.0.5.0
(#4980)- There was a precision issue in the
AlleleFrequencyCalculator
when running with-new-qual
that could cause a crash at certain sites (specifically, sites with spanning deletions and highly unlikely alt alleles).
- There was a precision issue in the
-
HaplotypeCaller
: don't count qual = 0 sites as polymorphic for GVCF mode (#4967) -
ValidateBasicSomaticShortMutations
: added a new optional argument to produce summary table output (#4982) -
ExtractOriginalAlignmentRecordsByNameSpark
: added a new optional argument to invert the logic in the read-name filtering (#4944) -
Separated out the "variant calling" integration tests from the rest of the integration tests to speed up overall test suite runtime in travis (#4984)
4.0.5.2
Highlights of this release include major Funcotator performance improvements on hg19/b37 inputs, a newly rewritten Java version of FilterVariantTranches, HaplotypeCaller bamout improvements, and improved Python integration by eliminate timeouts.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/.
Funcotator Improvements
- Improve handling of hg19/B37 references (#4586).
- Fixed performance bug involving excessive cache misses when querying datasources, resulting in major
performance improvements when running on HG19/B37 data (performance increased by approx. 30x with v1.4.20180615 of
the standard Funcotator data sources) (#4586). - Automatically detect when B37 data run against hg19 data source and convert contig names to be hg19 compliant.
- Assumes all data sources for the hg19 reference are compliant with hg19 contig names. User-created data
sources will have to honor this. - Perform additional validation on input data to ensure a given reference FASTA has a sequence
dictionary that is a superset of the given VCF. This is a more stringent check than is automatically
performed by the GATK. Can be disabled with the--disable-sequence-dictionary-validation
flag. - Released new version of datasources to go with this release (1.4.20180615), necessary because the data
sources needed to be made consistent with hg19 (before they were a mix of hg19 and b37 contig names). - Updated the minimum required data source version to be the latest release.
- Updated the
getDbSNP.sh
andcreateSqliteCosmicDb.sh
data source scripts to preprocess those data sources
to have hg19-compliant contigs names. - Removed the
--allow-hg19-gencode-b37-contig-matching
flag. - Removed the
--allow-hg19-gencode-b37-contig-matching-override
flag.
- Fixed performance bug involving excessive cache misses when querying datasources, resulting in major
- User defined transcripts were being used as a filter rather than a priority order. The filtering step has been eliminated. Fixes #4918 (#4931)
- Added custom MAF fields to MafOutputRenderer (#4917)
- LocatableXsv data sources now produce at most 1 funcotation per allele pair. (#4936)
- LocatableXsv data sources now provide the correct number of funcotations (#4915)
- Preserve VCF fields in MAF output (#4872)
- Fixing error when spanning deletions overlap coding regions (#4881)
HaplotypeCaller/Mutect2
- Improvements to FilterMutectCalls. Eliminates about 3% of all false positives in DREAM while reducing sensitivity by about 0.1%
- Fix many questionable -bamout alignments where, because of a bad choice of Smith-Waterman parameters,
deletions were preferred over single-base substitutions.(#4858)
Result is many fewer spurious indels in the -bamout output. - Introduced new SmithWaterman parameters affecting realignment of the reads to their best haplotype. This
also changes some annotations that depend on the alignment, such asBaseQualityRankSum
andReadPositionRankSum
.
The changes are slight and make things more correct. - Modify the behavior of (BaseGraph) getNextReferenceVertex for non-ref paths (#4889)
FilterVariantTranches
- Rewrite VCF Tranche filtering in java, with tests (#4800)
Engine
- StreamingPythonExecutor no longer uses timeouts or relies on prompt synchronization. (#4757)
- Allow concordance tools (AbstractConcordanceWalker) to use NIO for truth call set (#4905)
- Add pre- and post- apply variant transformer to VariantWalkerBase
MarkDuplicatesSpark
- Fixed a missing special case in MarkDuplicates ReadsKey code to better match current picard results (#4899)
- Reworked the keys for MarkDuplicatesSpark to be sufficient for grouping on their own. (4878)
- Improve error message for MarkDuplicates duplicates readnames issues (#4879)
Structural Variants
- Add tests for AssemblyContigWithFineTunedAlignments (#4961)
- Fix no index output for assembly bam file (#4945)
- Overhaul tests on assembly-based non-complex breakpoint and type inference code (#4835)
- Simple fix to remove trailing slash in GCS_SAVE_PATH to avoid double slashes in GCS_RESULTS_DIR (#4873)
Misc:
4.0.5.1
This is primarily a bug fix release to fix a crash in the help system (#4875). The issue was that tools that use annotations (which includes Mutect2
, HaplotypeCaller
, GenotypeGVCFs
, CombineGVCFs
, and VariantAnnotator
) would crash when trying to print their help text. This could be triggered by running with an explicit --help
, or by typing an invalid tool command line.
This release also brings in some improvements to Funcotator
, including a new mode to output annotations for all transcripts.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
- Fix crash when displaying help text for tools that use annotations (#4876)
Funcotator
improvements (#4838) (#4870)- Added
ALL
mode for transcript selection (--transcript-selection-mode ALL
) which will output full annotation fields for all transcripts - IGR annotation are no longer reported if there are any transcripts that would result in a non-IGR annotation for a given variant
- VCF Datasources now have to match both the alt and ref alleles to be added as annotations to a variant
- Added the
--allow-hg19-gencode-b37-contig-matching-override
flag to allow for even more permissive matching contig names between B37 and HG19 references (primarily designed to be used in development) - Updated the experimental Funcotator WDL to work properly in cromwell
- Refactored internals of
Funcotator
to useFuncotationMap
objects to store annotations - Additional tests to ensure VCF and MAF protein change strings are equivalent
- Other minor internal bugfixes for testing
- Added
- Fix to the Oncotator command line in the
Mutect2
WDL (#4862) - Removed unsupported
Mutect2
WDLs (these now live on Firecloud) (#4836)
4.0.5.0
Highlights of this release include the ability to emit MNPs in Mutect2
and HaplotypeCaller
via a new --max-mnp-distance
argument, much better active region detection for low allele fractions in Mutect2
, new priors for variants sites and homRef blocks in HaplotypeCaller
, a new tool FilterAlignmentArtifacts
to filter false positive alignment artifacts in the Mutect2
pipeline, performance improvements to CNNScoreVariants
and Funcotator
, and a new --sites-only-vcf-output
GATK engine argument to suppress genotypes when writing VCFs.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
Mutect2
- Made
Mutect2
active region determination much better for low allele fractions (#4832)- In particular, this makes
Mutect2
vastly better for mitochondrial and cfDNA calling
- In particular, this makes
Mutect2
can now emit MNPs according to adjustable distance threshold specified via--max-mnp-distance
(#4650)- Tweaked
Mutect2
read position filter to handle non-biological (eg FFPE) insertions better (#4851) - Fixed
Mutect2
bug where triallelic normal artifacts were sometimes hidden from filtering engine (#4809) Mutect2
STR filter now also looks at insertions (#4845)- This lowers the indel false positive rate dramatically.
Mutect2 WDL
:
- Made
-
Added new tool
FilterAlignmentArtifacts
(#4698)- Filters false positive alignment artifacts (that is, apparent variants due to reads being mapped to the wrong genomic locus) from a VCF callset by checking variant-supporting reads and their mates.
- By considering the realignment of the read and its mate, it saves a lot of variants, especially in low-complexity regions, from being filtered as mapping errors.
-
HaplotypeCaller
HaplotypeCaller
can now emit MNPs according to adjustable distance threshold specified via--max-mnp-distance
(#4650)- New
HaplotypeCaller
priors for variants sites and homRef blocks (#4793)- Added new
--population-callset
argument allowing an external panel of variants to be specified to inform the frequency distribution underlying the genotype priors - Added new
--num-reference-samples-if-no-call
argument to control whether to infer (and with what effective strength) that only reference alleles were observed at sites not seen in any panel - As a side effect of this change,
CalculateGenotypePosteriors
now supports indels.
- Added new
- GCS/NIO output support for the
-bamout
argument (#4721)
-
-new-qual
inHaplotypeCaller
/Mutect2
/GenotypeGVCFs
no longer counts spanning deletions as support for variant qual (#4801) -
CNNScoreVariants
-
GATK Engine
- Added a new traversal type
TwoPassVariantWalker
that does two passes over its input variants (#4744) - Enable the
-L
argument to read feature files (such as.bed
or.vcf
files) from non-local Paths, including GCS buckets (#4854) - Added
--sites-only-vcf-output
argument to the GATK engine to suppress genotype fields when writing VCFs (#4764) - Tools that use annotations now use the barclay annotation plugin (#4674)
- Added new
ReadQueryNameComparator
(#4731) - Automatically schedule temporary resource files for delete on exit (#4616)
- Added a new traversal type
-
Spark tools
-
MarkDuplicatesSpark
- Fixed
MarkDuplicatesSpark
so it handles supplementary reads with unmapped mates properly (#4785) - Added a distinction between PCR orientation and Optical Duplicates orientation in
MarkDuplicatesSpark
(#4752) - Fixed serialization crash in
MarkDuplicatesSpark
(#4778) - Fixed queryname partitioning bug where asking for queryname sort would result in reads with the same name being split between partitions (#4765)
- Changed
MarkDuplicatesSpark
to sort non-queryname sorted bams before processing to ensure marking is consistent across shards (#4732) - Renamed some
MarkDuplicatesSpark
arguments to follow the "kabob-style" convention (#4715) MarkDuplicatesSpark
now uses the PicardOpticalDuplicatesFinder
directly (#4750)MarkDuplicatesSpark
now uses Picard metrics code directly (#4779)
- Fixed
-
BwaSpark
: disable sequence dictionary validation when aligning reads #4131 (#4308) -
Funcotator
- Major performance improvements due to added caching and other optimizations (#4740)
- Various fixes (#4783) (#4817) (#4770)
- Sanitize special characters when outputting VCF so that VCF validation passes
- Ordering specified in the header did not match the variants and hg19/b37 - VCF datasources were being inconsistently processed, inducing a lot of missed annotations.
- Added Funcotator tests for Clinvar and Gencode v28 in hg38, and mixed chr/no-chr GENCODE.
- Eased restrictions so that Gencode v28 would be recognized as a valid gtf. Future versions of Gencode will not fail just based on the version number and warning will be emitted instead.
- Refining handling of transcripts with missing sequence info.
- Refactored UTR VariantClassification handling.
- Added warning statement when a transcript in the UTR has no sequence info (now is the same behavior as in protein coding regions).
- Added tests to prevent regression on data source date comparison bug.
- Fixed DNA Repair Genes getter script.
- Fixed an issue in COSMIC to make it robust to bad COSMIC data.
- Gencode no longer crashes when given an indel that starts just before an exon.
- Fixed the SimpleKeyXsvFuncotationFactory to allow any characters to work as delimiters (including characters used in regular expressions, such as pipes).
- Modified several methods to allow for negative start positions in preparation for allowing indels that start outside exons.
- Fixed an issue in 5' UTR processing that would cause variant alleles with length > 1 to throw an exception (fixes issue #4712).
- Fixed a bug in the version detection for Funcotator data sources that would prevent newer data source versions from being detected as compatible (date comparison error).
- Gencode data sources now have names preserved from config files. (#4823)
-
GCNV
kernel tunings (#4720)- Fixed a minor issue in sampling error estimation that could lead to NaN (as a result of division by zero)
- Introduced separate internal and external admixing rates
- Introduced two-stage inference for cohort denoising and calling
- Capped phred-scaled qualities to maximum values permitted by machine precision in order to avoid NaNs and overflows.
- Took a first step toward tracking and logging parameters during inference, starting with the ELBO history.
-
Validation of sequence dictionaries from multiple BAMs now throws warning instead of exception in CNV workflows. (#4758)
-
SV tools
- Tweak BWA to allow "gappier" alignments in local assemblies (#4708)
- Added a new experimental tool named
CpxVariantReInterprepterSpark
to extract barebone-annotated simple variants from an GATK-SV discovery pipeline produced VCF containing complex variants (#4602) - Fix "UnhandledCaseSeen" error in
StructuralVariationDiscoveryPipelineSpark
(#4677)
-
Added new
SingleSequenceReferenceAligner
class to align against an on-the-fly single contig reference using Bwa-Mem (#4780) -
Updates to the conda environment for Python-based tools (#4749)
- Fix #4741, where newer versions of conda appear to treat relative references in the environment yml as being relative to the yml file instead of relative to the cwd (based on observation).
- Add a second conda yml file (
gatkcondaenv.intel.yml
) for environments that use Intel hardware acceleration and the Intel Tensorflow package (based on #4735). - Added a gradle task (
condaEnvironmentDefinition
) to generate the conda yml files from a single template to ensure that all the environment definitions remain in sync. This task also generates the Python package archive. - Added a gradle task (
localDevCondaEnv
) to create or update a local (non-Intel) conda environment. This is a shortcut for use during development when you're iteratively changing/testing Python code and want to update the conda env.
-
Added a new WEX test bam to
src/test/resources/large
, with a companion target interval list (#4756) -
Add slightly modified version of GATK3 github issue template (#4796)
-
Updated htsjdk to 2.15.1 (#4830)
4.0.4.0
Highlights of this release include major performance improvements to MarkDuplicatesSpark
, better sensitivity and precision in STR (short tandem repeat) contexts for Mutect2
, support for a "genotype given alleles" mode in Mutect2
, dbSNP support for Funcotator
, and several important bug fixes to CombineGVCFs
.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Full list of changes in this release:
-
MarkDuplicatesSpark
- New, optimized version of the tool with greatly improved performance and scalability (#4656)
- Note that this tool is still marked as beta, and has a number of known issues. The current version is suitable for evaluation/profiling purposes only.
-
Mutect2
improvements- Added a GGA (genotype given alleles) mode activated via the
--genotyping-mode GENOTYPE_GIVEN_ALLELES
and--alleles
arguments (#4601) - Better sensitivity and precision in STR (short-tandem repeat) contexts (#4690)
- New, supported Mutect2 NIO-enabled WDL that works in Firecloud (#4710)
- Better default AF for M2 tumor-normal mode (#4690)
- Restored explicit PASS (as opposed to empty) filter in Mutect2 (#4644)
- Fixed Mutect2 failure for germline resource without AF (#4607)
- Fixed a bug in the Mutect2 WDL bamout where scatters with overlapping assembly regions failed (#4613)
- Fixed extra filtering args being deactivated in Mutect2 WDL due to typo
- Added a GGA (genotype given alleles) mode activated via the
-
CombineGVCFs
: several important bug fixes -
Funcotator
- Added dbSNP support via a new VcfFuncotationFactory. (#4593)
- Fixed the refContext annotation. (#4605)
- Fixed calculation of GC content to be correct. (#4608)
- Fixes for HG38 exception and better logging. (#4563)
- Note: only datasource releases
1.2.20180329
and later will work with this version of Funcotator
-
HaplotypeCaller
: Fixed a bug that caused the--comp
and--input-prior
arguments to not be settable by the user (#4703) -
CNNScoreVariants
: Better numerical consistency between python and java, and transpose bug fix (#4652) -
CNV Tools
-
SV Tools
-
Added GCS (Google Cloud Storage) output support to the following tools:
ApplyBQSR
,SplitNCigarReads
,ClipReads
,LeftAlignIndels
,RevertBaseQualityScores
, andUnmarkDuplicates
(#4695) (#4424) -
Mark the
--disable-tool-default-read-filters
argument as advanced, and add a warning to its documentation string (#4671)- Many tools do not function correctly without their default read filters turned on, so this argument is intended only for advanced users who know what they're doing!
-
ParallelCopyGCSDirectoryIntoHDFSSpark
: allow the tool to take a filename glob to subset files to copy (#4624) -
Picard: updated to version 2.18.2 (#4676)
4.0.3.0
This release brings a major update to our experimental neural-network-based VariantRecalibrator
replacement, initial MAF
support in Funcotator
, as well as some updates to Mutect2
and the CNV
tools.
As usual, a docker image for this release can be downloaded from https://hub.docker.com/r/broadinstitute/gatk/
Summary of changes in this release:
-
A major update to our experimental neural-network-based suite of variant scoring tools, which will eventually replace the
VariantRecalibrator
(#4245)- The
NeuralNetInferenceTool
has been renamed toCNNScoreVariants
- Baseline models are now included in the distribution.
- Added additional tools to write tensors and to train your own models given a VCF of validated calls, an unfiltered VCF and a confident region:
CNNVariantTrain
,CNNVariantWriteTensors
andFilterVariantTranches
- Read-level 2D models are now supported via the tensor-type read_tensor argument. 2D models at present are significantly slower than the 1D models.
- The
-
Funcotator
:- Added prototype support for outputting
MAF
files (and many bug fixes) (#4472)
- Added prototype support for outputting
-
Mutect2:
-
CNV
tools:- Replaced
CollectFragmentCounts
withCollectReadCounts
. (#4564) - Allowed use of zero eigensamples in
DenoiseReadCounts
. (#4411) - Changed filtering of normal hets on overlap with copy-ratio intervals in
ModelSegments
to be consistent with filtering of case hets. (#4510) - Updated PostprocessGermlineCNVCalls (segments VCF writing, WDL scripts, unit tests, integration tests) (#4396)
- Replaced
-
Miscellaneous changes:
Concordance
: added option to analyze contributions of different filters (#4520)- Exposed the
-pairHMM
/--pair-hmm-implementation
argument inHaplotypeCaller
, which was previously hidden (#4494) - Set the default
samjdk.compression_level
to 2 (was previously 1) (#4547) - Upgraded to Spark 2.2.0 (#4314)
- Changed Spark sharding of queryname-sorted bams to better handle secondary and supplementary reads (#4473)
- Added logging output to the bam writing step for spark tools (#4501)
git-lfs
is now required to compile the GATK- Added a registry for deprecated/unported tools. (#4505)
- Updated the Hadoop GCS connector from 1.6.1 to 1.6.3. (#4590)
- Added a large runtime resource directory to
git-lfs
, and exposed it to the Docker build. (#4530) - We now include full tool documentation in the GATK binary distribution zip (#4377)
- Made our maven artifacts much smaller by preventing gradle uploadArchives from including distZip and distTar (#4569)
- Added chr20 and chr21 alt contigs to the
GRCh38
reference snippet used for testing (#4548)