Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: a new GenotypesPLINKTR class for reading TRs from PGEN files #222

Merged
merged 42 commits into from
Sep 17, 2023

Conversation

aryarm
Copy link
Member

@aryarm aryarm commented Sep 15, 2023

resolves #73 (comment)

This code was primarily authored by our summer high school student, @gonzalogc1, but he had to leave before we could finish it off. Thanks for all of the help this summer, Gonzalo!

Overview

This PR enables reading and writing multiallelic variants from a PGEN file. It also adds a new GenotypesPLINKTR class capable of reading TRs from PGEN files similar to the GenotypesTR class.

Note: A newer version of Pgenlib is required as a result of this change. The pyproject.toml file has been updated accordingly.

Usage and Documentation

The interface for the GenotypesPLINK class has not changed. It will work the same as before but will cease to raise an error for multiallelic variants. This functionality also carries over to the simgenotype command.

The GenotypesPLINKTR class works as expected; it has the same methods as the GenotypesTR and GenotypesPLINK classes, but a few of them from the GenotypesPLINK class have been disabled. Refer to the documentation for a full list of the disabled methods and a more thorough description of how to use the new class.

Tests

We added new tests and a new test data file for the GenotypesPLINK and GenotypesPLINKTR classes. The tests/data/simple-tr.pgen set of files are similar to the tests/data/simple_tr.pgen set of files except that some alleles have been removed so that the VCF file passes the checks implemented by plink2 when converting from VCF to PGEN.

Future work

  • feat: a dosage command for storing TR motif counts in PGEN files #221
  • Try to improve TR loading speed by using TRHarmonizer objects in GenotypesPLINKTR.read_variants(). Those objects can then be reused by GenotypesPLINKTR.read(). This will require a larger refactor that is out of scope for this PR.
  • now that wheels are being published for Pgenlib, we can make it a required (instead of an optional) dependency of haptools

aryarm and others added 30 commits July 19, 2023 12:58
also, add test for reading TRs in TestGenotypesPLINK
…KTR class of the test/test_data.py file, also I already added the __init__ function in the GenotypesPLINKTR class in genotype.py in the data file
…s(), check_biallelic(), and check_maf() for GenotypesPLINKTR
by using broadcasting across the number of variants
where there is an extra ALT allele that isn't used within any of the genotypes
also explain how to convert TR VCFs to PGEN
@aryarm aryarm marked this pull request as draft September 15, 2023 20:49
@aryarm aryarm marked this pull request as ready for review September 15, 2023 21:27
@aryarm aryarm changed the title feat: a new GenotypesPlinkTR class for reading TRs from PGEN files feat: a new GenotypesPLINKTR class for reading TRs from PGEN files Sep 15, 2023
haptools/data/genotypes.py Show resolved Hide resolved
tests/test_data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@mlamkin7 mlamkin7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good.

@aryarm aryarm merged commit 3c7abe6 into main Sep 17, 2023
8 checks passed
@aryarm aryarm deleted the feat/genotypesplinktr branch October 4, 2023 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for a TR-based GenotypesPLINK class
3 participants