Skip to content

major performance improvements, bugfixes

Compare
Choose a tag to compare
@brentp brentp released this 10 Oct 18:05
· 131 commits to master since this release

The main change in this release is the use of bitvectors to calculate all-vs-all relatedness. This speeds up the relatedness step by about 100X such that we can calculate relatedness of all 4,825,171 possible pairwise combinations of the 2,504 thousand genomes samples in about 20 seconds.
It also fixes a bug in the a-allele/b-allele designation for VCF that caused problems when comparing samples extracted from VCF/BCF to those from CRAM/BAM.

The readme now includes instructions on how to estimate ancestry from somalier sketches.

v0.2.3

  • calculate relatedness correctly for samples with parent-ids specified
    when the parents are not actually in the pedigree file.
  • use bit-vectors to calculate relatedness. this gives up to a 250X speedup.
    with this code, I can now evaluate relatedness for 3756 in under 30 seconds on my laptop.
  • better scaling of X and Y depth
  • use final RG as the sample id in relate
  • output expected relatedness in .pairs.tsv file
  • fix ref/alt (a/b-allele ordering for VCF) this was a bug that caused problems when comparing
    samples extracted from VCF files to other samples extracted from BAM/CRAM files. Thanks very
    much to Filipe and Sergio for finding this issue and providing several test-cases. (if you
    have previously downloaded the thousand genomes files from zenodo, please update to the latest).

sites files

These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.

sites.hg19.vcf.gz
sites.hg38.nochr.vcf.gz
sites.GRCh37.vcf.gz
sites.hg38.vcf.gz