Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a new GenotypesPLINKTR class for reading TRs from PGEN files #222

Merged
merged 42 commits into from
Sep 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
296e75d
update Pgenlib to >=0.90.1
aryarm Jul 19, 2023
5b24aa9
Here it's the test_write_multiallelic()
gonzalogc1 Jul 24, 2023
d0fb106
I add multiallelic variants support for genotypePLINK
gonzalogc1 Jul 26, 2023
b8cbffc
create skeleton for GenotypesPLINKTR class
aryarm Jul 26, 2023
a4a9584
add plink2 TR test files
aryarm Jul 26, 2023
5f7ab28
clean up tests in test_data.py
aryarm Jul 26, 2023
f07751a
add TestGenotypesPLINKTR._fake_genotypes_multiallelic() dummy method
aryarm Jul 26, 2023
449c554
create new simple-tr-valid VCF and PGEN files
aryarm Jul 27, 2023
fd6b318
leverage new dummy method in TestGenotypesPLINK
aryarm Jul 27, 2023
7f1db2d
I just added test_read_plinktr and test_iter in the TestGenotypesPLIN…
gonzalogc1 Jul 28, 2023
5b3edfb
Added the load() method in the GenotypesPLINKTR class
gonzalogc1 Aug 3, 2023
65d8bcf
Add the following methods: read(), _iterate(), write(), write_variant…
gonzalogc1 Aug 7, 2023
f444ab2
handle missing and unphased genotypes in plinkgenotypestr
aryarm Aug 8, 2023
6b2c0c1
Start benchmarking of GenotypesPLINKTR
gonzalogc1 Aug 9, 2023
3d98d08
add benchmarking of TRs to bench_genotypes.py
aryarm Aug 10, 2023
47cdf13
fix progress bar in bench_genotypes.py
aryarm Aug 11, 2023
939353e
avoid using GetLengthGenotypes for speedup
aryarm Aug 11, 2023
f42acff
extract TRRecords iteration to a separate method
aryarm Aug 11, 2023
ef440d6
fix some small syntax things
aryarm Aug 11, 2023
8b8fcff
improve scaling of GenotypesPLINKTR as we scale the number of variants
aryarm Aug 11, 2023
e9137f5
fix failing test for GenotypesPLINKTR._iterate
aryarm Aug 11, 2023
b7973db
use numpy array method of cyvcf2 for slight speedup
aryarm Aug 13, 2023
8bdd4c7
ensure all imports are relative within haptools/
aryarm Aug 13, 2023
14be007
finish writing __iter__() and write() TR tests
aryarm Aug 13, 2023
9253bb0
add edge case from plink2-users discussion
aryarm Aug 14, 2023
436878f
transpose phasing before concatenating dummy data
aryarm Aug 14, 2023
8a8a110
rename simple-tr-valid test files to simple-tr
aryarm Aug 14, 2023
61e7163
handle allele_cts parameter when writing PGEN files
aryarm Aug 15, 2023
a32e1c3
docs: the new GenotypesPLINKTR class
aryarm Aug 15, 2023
8349689
add support for TR PGENs in clump and sim_phenotype
aryarm Aug 15, 2023
ec294bc
add support for multiallelic PGENs in sim_genotype
aryarm Aug 15, 2023
02cbc72
try and fail to speed up GenotypesPLINKTR.read()
aryarm Aug 31, 2023
b8f2c45
Revert "try and fail to speed up GenotypesPLINKTR.read()"
aryarm Aug 31, 2023
394f5ce
remove gxx requirement now that pgenlib is published with wheels
aryarm Sep 15, 2023
d97c757
refmt with black
aryarm Sep 15, 2023
b6bf7ec
do not copy alleles from list to tuple
aryarm Sep 15, 2023
f97a5e8
Merge branch 'feat/genotypesplinktr' of github.com:CAST-genomics/hapt…
aryarm Sep 15, 2023
1edea76
handle gts data arrays with dtype np.bool_ when counting uniq vals
aryarm Sep 15, 2023
e13d6f2
test multiallelic pgen reading/writing
aryarm Sep 16, 2023
7aab3c0
make sure we can write bool gts too
aryarm Sep 16, 2023
f399d8e
refmt with black
aryarm Sep 16, 2023
1835c04
Merge branch 'main' into feat/genotypesplinktr
aryarm Sep 17, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 24 additions & 1 deletion docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ All of the other methods in the :class:`Genotypes` class are inherited, but the

GenotypesTR
++++++++++++
The :class:`GenotypesTR` class *extends* :class:`Genotypes` class. The :class:`GenotypesTR` class follows the same structure of :class:`GenotypesVCF`, but can now load repeat count of tandem repeats as the alleles.
The :class:`GenotypesTR` class *extends* the :class:`Genotypes` class. The :class:`GenotypesTR` class follows the same structure of :class:`GenotypesVCF`, but can now load repeat counts of tandem repeats as the alleles.

All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesTR` class' ``load()`` function is unique to loading tandem repeat variants.

Expand All @@ -178,6 +178,11 @@ All of the other methods in the :class:`Genotypes` class are inherited, but the
# make the first sample have 4 and 7 repeats for the alleles of the fourth variant
genotypes.data[0, 3] = (4, 7)

The following methods from the :class:`Genotypes` class are disabled, however.

1. ``check_biallelic``
2. ``check_maf``

.. _api-data-genotypestr:

GenotypesPLINK
Expand Down Expand Up @@ -239,6 +244,24 @@ A large ``chunk_size`` is more likely to result in memory over-use while a small
genotypes = data.GenotypesPLINK('tests/data/simple.pgen', chunk_size=500)
genotypes.read()

GenotypesPLINKTR
++++++++++++++++
The :class:`GenotypesPLINKTR`` class extends the :class:`GenotypesPLINK` class to support loading tandem repeat variants.
The :class:`GenotypesPLINKTR` class works similarly to :class:`GenotypesTR` by filling the ``data`` property with repeat counts for each allele.

The following methods from the :class:`GenotypesPLINK` class are disabled, however.

1. ``write``
2. ``check_maf``
3. ``write_variants``
4. ``check_biallelic``

The :class:`GenotypesPLINKTR` uses INFO fields from the PVAR file to determine the repeat unit and the number of repeats for each allele. To ensure your PVAR file contains the necessary information, use the following command when converting from VCF.

.. code-block:: bash

plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --make-pgen --out output

haplotypes.py
~~~~~~~~~~~~~
Overview
Expand Down
19 changes: 13 additions & 6 deletions docs/formats/genotypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,14 @@ Genotypes
The time required to load various genotype file formats.

VCF/BCF
-------
~~~~~~~

Genotype files must be specified as VCF or BCF files. They can be bgzip-compressed.

.. _formats-genotypesplink:

PLINK2 PGEN
-----------
~~~~~~~~~~~

There is also experimental support for `PLINK2 PGEN <https://github.com/chrchang/plink-ng/blob/master/pgen_spec/pgen_spec.pdf>`_ files in some commands. These files can be loaded and created much more quickly than VCFs, so we highly recommend using them if you're working with large datasets. See the documentation for the :class:`GenotypesPLINK` class in :ref:`the API docs <api-data-genotypesplink>` for more information.

Expand All @@ -29,9 +29,16 @@ If you run out memory when using PGEN files, consider reading/writing variants f

pip install haptools[files]

.. warning::
At the moment, only biallelic SNPs can be encoded in PGEN files because of limitations in the ``Pgenlib`` python library. It doesn't properly support multiallelic variants yet (`source <https://github.com/chrchang/plink-ng/blob/c4b8d4361de74c58f0cc11361062eca4f34210d3/2.0/Python/python_api.txt#L88-L89>`_). To ensure your PGEN files only contain SNPs, we recommend use the following command to convert from VCF to PGEN.
Converting from VCF to PGEN
---------------------------
To convert a VCF containing only biallelic SNPs to PGEN, use the following command.

.. code-block:: bash
.. code-block:: bash

plink2 --snps-only 'just-acgt' --max-alleles 2 --vcf input.vcf --make-pgen --out output

To convert a VCF containing tandem repeats to PGEN, use this command, instead.

.. code-block:: bash

plink2 --snps-only 'just-acgt' --max-alleles 2 --vcf input.vcf --make-pgen --out output
plink2 --vcf-half-call m --make-pgen 'pvar-cols=vcfheader,qual,filter,info' --vcf input.vcf --make-pgen --out output
2 changes: 1 addition & 1 deletion docs/project_info/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ Follow these steps to set up a development environment.

.. code-block:: bash

conda create -n haptools-dev -c conda-forge 'poetry>=1.1.15' 'python=3.7' 'gxx_linux-64'
conda create -n haptools-dev -c conda-forge 'poetry>=1.1.15' 'python=3.7'
2. Activate the environment

.. code-block:: bash
Expand Down
11 changes: 6 additions & 5 deletions haptools/clump.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,12 @@
#!/usr/bin/env python

# To test: ./clumpSTR.py --summstats-snps tests/eur_gwas_pvalue_chr19.LDL.glm.linear --clump-snp-field ID --clump-field p-value --clump-chrom-field CHROM --clump-pos-field position --clump-p1 0.2 --out test.clump
import sys
import math
from logging import Logger, getLogger

import numpy as np

from .data import Genotypes, GenotypesVCF, GenotypesTR
from .data import Genotypes, GenotypesVCF, GenotypesTR, GenotypesPLINKTR


class Variant:
Expand Down Expand Up @@ -557,15 +556,17 @@ def clumpstr(
strgts = None
gts = None
if gts_snps:
log.debug("Loading SNP Genotypes.")
if str(gts_snps).endswith("pgen"):
log.debug("Loading SNP Genotypes.")
snpgts = GenotypesPLINK.load(gts_snps)
else:
log.debug("Loading SNP Genotypes.")
snpgts = GenotypesVCF.load(gts_snps)
if gts_strs:
log.debug("Loading STR Genotypes.")
strgts = GenotypesTR.load(gts_strs)
if str(gts_strs).endswith("pgen"):
strgts = GenotypesPLINKTR.load(gts_strs)
else:
strgts = GenotypesTR.load(gts_strs)

if gts_snps and gts_strs:
log.debug("Calculating set of overlapping samples between STRs and SNPs.")
Expand Down
8 changes: 7 additions & 1 deletion haptools/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,10 @@
from .covariates import Covariates
from .breakpoints import Breakpoints, HapBlock
from .haplotypes import Extra, Repeat, Variant, Haplotype, Haplotypes
from .genotypes import Genotypes, GenotypesVCF, GenotypesTR, GenotypesPLINK
from .genotypes import (
Genotypes,
GenotypesVCF,
GenotypesTR,
GenotypesPLINK,
GenotypesPLINKTR,
)
Loading
Loading