Skip to content

Commit

Permalink
feat: Clump (#211)
Browse files Browse the repository at this point in the history
  • Loading branch information
mlamkin7 authored May 2, 2023
1 parent 9e2ffe1 commit 3740ec1
Show file tree
Hide file tree
Showing 11 changed files with 1,447 additions and 1 deletion.
8 changes: 8 additions & 0 deletions docs/api/haptools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,3 +129,11 @@ haptools.index module
:members:
:undoc-members:
:show-inheritance:

haptools.clump module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: haptools.clump
:members:
:undoc-members:
:show-inheritance:
108 changes: 108 additions & 0 deletions docs/commands/clump.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
.. _commands-clump:


clump
=====

Clump a set of variants specified as a :doc:`.hap file </formats/genotypes>`.

The ``clump`` command creates a clump file joining SNPs or STRs in LD with one another.

Usage
~~~~~
.. code-block:: bash
haptools clump \
--verbosity [CRITICAL|ERROR|WARNING|INFO|DEBUG|NOTSET] \
--summstats-snps PATH \
--gts-snps PATH \
--summstats-strs PATH \
--gts-strs PATH \
--clump-field TEXT \
--clump-id-field TEXT \
--clump-chrom-field TEXT \
--clump-pos-field TEXT \
--clump-p1 FLOAT \
--clump-p2 FLOAT \
--clump-r2 FLOAT \
--clump-kb FLOAT \
--ld [Exact|Pearson] \
--out PATH
Examples
~~~~~~~~
.. code-block:: bash
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--out test_snps.clump
You can use ``--ld [Exact|Pearson]`` to indicate which type of LD calculation you'd like to perform. ``Exact`` utilizes an exact cubic solution adopted from `CubeX <https://github.com/t0mrg/cubex>`_ whereas ``Pearson`` utilizes a Pearson R calculation. Note ``Exact`` only works on SNPs and not any other variant type eg. STRs. The default value is ``Pearson``.

.. code-block:: bash
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
You can modify thresholds and values used in the clumping process. ``--clump-p1`` is the largest value of a p-value to consider being an index variant for a clump. ``--clump-p2`` dictates the maximum p-value any variant can have to be considered when clumping. ``--clump-r2`` is the R squared threshold where being greater than this value implies the candidate variant is in LD with the index variant. ``--clump-kb`` is the maximum distance upstream or downstream from the index variant to collect candidate variants for LD comparison. For example, ``--clump-kb 100`` implies all variants 100 Kb upstream and 100 Kb downstream from the variant will be considered.

.. code-block:: bash
haptools clump \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--clump-p1 0.001 \
--clump-p2 0.05 \
--clump-r2 0.7 \
--clump-kb 200.5 \
--out test_snps.clump
You can also input STRs when calculating clumps. They can be used together with SNPs or alone.

.. code-block:: bash
haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--summstats-snps tests/data/test_snpstats.linear \
--gts-snps tests/data/simple.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
.. code-block:: bash
haptools clump \
--summstats-strs tests/data/test_strstats.linear \
--gts-strs tests/data/simple_tr.vcf \
--clump-id-field ID \
--clump-chrom-field CHROM \
--clump-pos-field POS \
--ld Exact \
--out test_snps.clump
All files used in these examples are described :doc:`here </project_info/example_files>`.


Detailed Usage
~~~~~~~~~~~~~~

.. click:: haptools.__main__:main
:prog: haptools
:show-nested:
:commands: clump
3 changes: 3 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ Commands

* :doc:`haptools index </commands/index>`: Sort, compress, and index our custom file format for haplotypes.

* :doc:`haptools clump </commands/clump>`: Convert variants in LD with one another into clumps.

* :doc:`haptools ld </commands/ld>`: Compute Pearson's correlation coefficient between a target haplotype and a set of haplotypes.

.. figure:: https://drive.google.com/uc?id=1c0i_Hjms7579s24zRsKp5yMs7BxNHed_
Expand Down Expand Up @@ -91,6 +93,7 @@ There is an option to *Cite this repository* on the right sidebar of `the reposi
commands/karyogram.rst
commands/transform.rst
commands/index.rst
commands/clump.rst
commands/ld.rst

.. toctree::
Expand Down
124 changes: 124 additions & 0 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -883,6 +883,130 @@ def index(
index_haps(haplotypes, sort, output, log)


@main.command(short_help="Clump summary stat files.")
@click.option(
"--summstats-snps",
type=click.Path(path_type=Path),
help="File to load snps summary statistics",
)
@click.option(
"--summstats-strs",
type=click.Path(path_type=Path),
help="File to load strs summary statistics",
)
@click.option(
"--gts-snps", type=click.Path(path_type=Path), help="SNP genotypes (VCF or PGEN)"
)
@click.option("--gts-strs", type=click.Path(path_type=Path), help="STR genotypes (VCF)")
@click.option(
"--clump-p1", type=float, default=0.0001, help="Max pval to start a new clump"
)
@click.option(
"--clump-p2", type=float, default=0.01, help="Filter for pvalue less than"
)
@click.option(
"--clump-id-field", type=str, default="SNP", help="Column header of the variant ID"
)
@click.option(
"--clump-field", type=str, default="P", help="Column header of the p-values"
)
@click.option(
"--clump-chrom-field",
type=str,
default="CHR",
help="Column header of the chromosome",
)
@click.option(
"--clump-pos-field", type=str, default="POS", help="Column header of the position"
)
@click.option(
"--clump-kb",
type=float,
default=250,
help="clump kb radius",
)
@click.option(
"--clump-r2",
type=float,
default=0.5,
help="r^2 threshold",
)
@click.option(
"--ld",
type=click.Choice(["Exact", "Pearson"]),
default="Pearson",
show_default=True,
help=(
"Calculation type to infer LD, Exact Solution or "
"Pearson R. (Exact|Pearson). Note the Exact Solution "
"works best when all three genotypes are present (0,1,2) in "
"the variants being compared."
),
)
@click.option(
"--out",
type=click.Path(path_type=Path),
required=True,
help="Output filename",
)
@click.option(
"-v",
"--verbosity",
type=click.Choice(["CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", "NOTSET"]),
default="INFO",
show_default=True,
help="The level of verbosity desired",
)
def clump(
summstats_snps: Path,
summstats_strs: Path,
gts_snps: Path,
gts_strs: Path,
clump_p1: float,
clump_p2: float,
clump_id_field: str,
clump_field: str,
clump_chrom_field: str,
clump_pos_field: str,
clump_kb: float,
clump_r2: float,
ld: str,
out: Path,
verbosity: str = "CRITICAL",
):
"""
Performs clumping on datasets with SNPs, SNPs and STRs, and STRs.
Clumping is the process of identifying SNPs or STRs that are highly
correlated with one another and concatenating them all together into
a single "clump" in order to not repeat the same effect size due to
LD.
"""
from .logging import getLogger
from .clump import clumpstr

log = getLogger(name="clump", level=verbosity)
log.debug(f"Loading SNPs from {summstats_snps} {gts_snps}")
log.debug(f"Loading STRs from {summstats_strs} {gts_strs}")

clumpstr(
summstats_snps,
summstats_strs,
gts_snps,
gts_strs,
clump_p1,
clump_p2,
clump_id_field,
clump_field,
clump_chrom_field,
clump_pos_field,
clump_kb,
clump_r2,
ld,
out,
log,
)


if __name__ == "__main__":
# run the CLI if someone tries 'python -m haptools' on the command line
main(prog_name="haptools")
Loading

0 comments on commit 3740ec1

Please sign in to comment.