Local alleles - Add LPL field #273

Will-Tyler · 2024-07-12T20:11:24Z

Overview

In this pull request, vcf2zarr computes a genotype-level, local alleles field, LPL, during the explode step unless specifically told not to. The LPL field is related to the LAA field. The LAA field is a list of one-based indices into the variant-level ALT field that indicates which alleles are relevant (local) for the current sample. The LPL field is a list of the Phred-scaled genotype likelihoods for the genotypes associated with the reference allele and the alleles given by the LAA field. The source of truth for the LPL field is the LAA field and the PL field. The PL field is a list of the Phred-scaled genotype likelihoods for all possible genotypes in the variant.

This pull request makes progress on #185. What remains is to prevent the creation and data storage of the PL field when vcf2zarr the local-allele fields are available.

Testing

I update and add unit tests based on the unit tests for the LAA field.

Here is the data from local_alleles.vcf.gz:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	01	02
chr1	3331	.	T	G	.	.	.	GT:DP:GQ:AD:PL	0/1:10:99:6,4:100,0,105	0/0:9:99:9,0:0,100,200
chr1	3335	.	G	C	.	.	.	GT:DP:GQ:AD:PL	0/0:11:99:11,0:0,100,200	1/1:7:22:0,7:154,22,0
chr1	3350	.	A	C,T	.	.	.	GT:DP:GQ:AD:PL	1/1:15:55:0,15,0:1002,55,1002,0,55,1002	0/2:12:99:7,0,5:154,154,0,154,102,102
chr1	3351	.	A	G,T,C	.	.	.	GT:LAA:AD	0/0:1,2,3:0,0,0,0	0/0:2,3:0,0,0,0
chr1	3352	.	A	G	.	.	.	GT:PL	0/0:30,30,30	0/0:30,60,0
chr1	3353	.	A	G,T,C	.	.	.	GT:PL	0:0,0,30,0	1:0,0,0,0
chr1	3354	.	A	G	.	.	.	GT:LAA:LPL:PL	0/0:1:30,30,30:40,40,40	0/0::30,30,30:40,40,40
chr1	3356	.	G	C,T	.	.	.	GT:LAA:PL	0/0:2:10,20,30,40,50,60	1/1:2:10,20,30,40,50,60
chr1	3357	.	A	G	.	.	.	GT:PL	0/0:30,30,30	0/0:30,.,0
chr1	3358	.	A	G	.	.	.	GT:PL	0/0:.	0/0:.
chr1	3359	.	A	G	.	.	.	GT:AD	0:20,10	1:0,20

These should cover the following cases:

LAA field absent/present,
LPL field absent/present,
PL field absent/present,
some, but not all, PL values missing,
monoploid and diploid.

References

VCFv4.5 specification PDF
- This VCF specification defines the LAA and LPL fields, as well as the other local-allele genotype fields.
The Scalable Variant Call Representation paper
- This paper introduces the local-alleles approach for the Scalable Variant Call Representation and highlights the motivation for introducing local alleles.

jeromekelleher · 2024-07-15T08:34:03Z

I think the test failures are related to #258 - the specific numbers we're checking against here depend on the version of numpy. Looks like we need to explicitly pin a numpy version somewhere.

jeromekelleher

Good progress, thanks @Will-Tyler. I've left a few minor comments.

It would be good to have a source of external truth here for computing these values. It's a pity that the bcftools thing didn't work out.

jeromekelleher · 2024-07-15T08:38:36Z

tests/test_vcf_examples.py

@@ -475,13 +515,18 @@ def test_ok_without_local_alleles(self, ds):

 class Test1000G2020Example:
    data_path = "tests/data/vcf/1kg_2020_chrM.vcf.gz"
+    expected_call_lpl_path = "tests/data/numpy/expected_call_LPL.npy"


I'm no so sure about how well this works as a testing strategy. I think testing against pre-computed values like this works fine when it's for small examples and you can see them in the code. But here, as a code reader, I don't know where this array comes from and it doesn't give me any reassurance that the values have been computed correctly.

I think a better approach would be to keep a copy of the 1kg_2020_chrM.vcf file that has these LPL values computed by some other method. Is there any way we can do this, other than using Hail? (Which would be a faff, I guess)

I put the expected values in a file because the ruff formater was reformatting the expected values into a giant mess. I will put the expected values back into the Python code and try disabling the ruff linter / formatter for the expected values. If we don't like that, maybe we can consider removing the test case. I'm not aware of any other ways to compute the local-allele fields. I checked bcftools again—they updated the tag2tag plugin yesterday, but it still does not support XX to LXX conversions.

I did manually inspect the expected values to make sure that all looked correct in comparison with the test dataset. Even if the test takes some manual effort to verify the first time, I think the test is still valuable for detecting changes.

jeromekelleher · 2024-07-15T08:40:20Z

tests/test_vcf_examples.py

+                [154, 22, 0, -2, -2, -2, -2, -2, -2, -2],
+            ],
+            [
+                [1002, 55, 1002, 0, 55, 1002, -2, -2, -2, -2],


The dimensions don't look right to me - why do we have so much padding? Shouldn't the inner dimension be 6 here, not 10? (That's the main point of the exercise here, to reduce the size of the last dimension in PL fields)

The inner dimension is 10 because at POS 3351, the first sample has a diploid genotype and three alternate allele indices. (I have pasted the test data in this pull request's description for easy reference.) Therefore, the LPL field should have ten entries, one for each possible genotype from the alleles, including the reference allele. In this case, all of the values are the missing value because the entry does not have a PL field.

I think that if we reduced the size by truncating the data, it would be more complicated for data consumers. We also need to support the case where only some, but not all of the PL values are missing, right? If so, I can add a test case for this.

jeromekelleher · 2024-07-15T08:48:04Z

bio2zarr/vcf2zarr/icf.py

@@ -1183,6 +1247,25 @@ def process_partition(self, partition_index):
            else:
                format_fields.append(field)

+        # We need to determine LAA before LPL


It's taken me a good while to figure this out - maybe we could express it more simply?

if "LAA" in format_fields and "LPL" in format_fields: laa_index = format_fields.index("LAA") lpl_index = format_fields.index("LPL") # LAA needs to come before LPL if lpl_index < laa_index: format_fields[laa_index] = "LPL" format_fields[lpl_index] = "LAA"

That's the same thing, right? Let's now worry about optimising for reducing the number of scans through the format_fields array.

Will-Tyler · 2024-07-16T17:14:31Z

I'm not sure that the test failures depend on the NumPy version. When I change my environment to use Python 3.10, the tests pass. When I change the environment to use Python 3.11, I get the failures seen in the CI. In both cases, I used NumPy version 1.26.0.

EDIT: I believe the breaking change is in numcodecs 0.13.0, which includes a change to the zstd compression algorithm (release notes). When I use version 0.13.0, I get the test failure. When I use version 0.12.1, the same tests pass. Version 0.13.0 was released on July 12th (source), the day that I opened this PR.

coveralls · 2024-07-16T19:45:40Z

coverage: 98.911% (+0.03%) from 98.886%
when pulling 7e88757 on Will-Tyler:local-alleles
into 4366ec1 on sgkit-dev:main.

jeromekelleher

This looks great, thanks @Will-Tyler! I think I need to play with this a bit now and see how it goes on some real datasets.

I think we just need to drop the pin on numcodecs (issued fixed in #275) and we're good to merge.

This reverts commit ef8af9a.

Will-Tyler · 2024-07-17T13:52:49Z

Should be rebased now with the numcodecs version unpinned. I'm excited to see how well this works in practice and happy to implement needed improvements that you identify. Thanks for reviewing!

jeromekelleher reviewed Jul 15, 2024

View reviewed changes

Will-Tyler mentioned this pull request Jul 16, 2024

Numcodecs v0.13.0 causing test failures #274

Closed

Will-Tyler requested a review from jeromekelleher July 16, 2024 23:26

jeromekelleher reviewed Jul 17, 2024

View reviewed changes

Will-Tyler added 12 commits July 17, 2024 13:49

Add LPL field during explode step

1314159

Add LPL field unit tests

2429350

Add test case for LPL field already present

18549c8

Remove saved expected result

ed95241

Simplify format field sorting

e146a74

Add test case for some, but not all PL values missing

f1d9c27

Fix test_call_LAA

10270ec

Fix numcodecs version

ef8af9a

Fix test_du

a3cd859

Improve test coverage

b108c5d

Fix test_du

353a964

Revert "Fix numcodecs version"

7e88757

This reverts commit ef8af9a.

Will-Tyler force-pushed the local-alleles branch from d9023b6 to 7e88757 Compare July 17, 2024 13:50

jeromekelleher merged commit 5105ffe into sgkit-dev:main Jul 17, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local alleles - Add LPL field #273

Local alleles - Add LPL field #273

Will-Tyler commented Jul 12, 2024 •

edited

Loading

jeromekelleher commented Jul 15, 2024

jeromekelleher left a comment

jeromekelleher Jul 15, 2024

Will-Tyler Jul 16, 2024

jeromekelleher Jul 15, 2024

Will-Tyler Jul 16, 2024

jeromekelleher Jul 15, 2024

Will-Tyler commented Jul 16, 2024 •

edited

Loading

coveralls commented Jul 16, 2024 •

edited

Loading

jeromekelleher left a comment

Will-Tyler commented Jul 17, 2024

Local alleles - Add LPL field #273

Local alleles - Add LPL field #273

Conversation

Will-Tyler commented Jul 12, 2024 • edited Loading

Overview

Testing

References

jeromekelleher commented Jul 15, 2024

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher Jul 15, 2024

Choose a reason for hiding this comment

Will-Tyler Jul 16, 2024

Choose a reason for hiding this comment

jeromekelleher Jul 15, 2024

Choose a reason for hiding this comment

Will-Tyler Jul 16, 2024

Choose a reason for hiding this comment

jeromekelleher Jul 15, 2024

Choose a reason for hiding this comment

Will-Tyler commented Jul 16, 2024 • edited Loading

coveralls commented Jul 16, 2024 • edited Loading

jeromekelleher left a comment

Choose a reason for hiding this comment

Will-Tyler commented Jul 17, 2024

Will-Tyler commented Jul 12, 2024 •

edited

Loading

Will-Tyler commented Jul 16, 2024 •

edited

Loading

coveralls commented Jul 16, 2024 •

edited

Loading