Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local alleles - Add LPL field #273

Merged
merged 12 commits into from
Jul 17, 2024
Merged

Conversation

Will-Tyler
Copy link
Contributor

@Will-Tyler Will-Tyler commented Jul 12, 2024

Overview

In this pull request, vcf2zarr computes a genotype-level, local alleles field, LPL, during the explode step unless specifically told not to. The LPL field is related to the LAA field. The LAA field is a list of one-based indices into the variant-level ALT field that indicates which alleles are relevant (local) for the current sample. The LPL field is a list of the Phred-scaled genotype likelihoods for the genotypes associated with the reference allele and the alleles given by the LAA field. The source of truth for the LPL field is the LAA field and the PL field. The PL field is a list of the Phred-scaled genotype likelihoods for all possible genotypes in the variant.

This pull request makes progress on #185. What remains is to prevent the creation and data storage of the PL field when vcf2zarr the local-allele fields are available.

Testing

I update and add unit tests based on the unit tests for the LAA field.

Here is the data from local_alleles.vcf.gz:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 01 02
chr1 3331 . T G . . . GT:DP:GQ:AD:PL 0/1:10:99:6,4:100,0,105 0/0:9:99:9,0:0,100,200
chr1 3335 . G C . . . GT:DP:GQ:AD:PL 0/0:11:99:11,0:0,100,200 1/1:7:22:0,7:154,22,0
chr1 3350 . A C,T . . . GT:DP:GQ:AD:PL 1/1:15:55:0,15,0:1002,55,1002,0,55,1002 0/2:12:99:7,0,5:154,154,0,154,102,102
chr1 3351 . A G,T,C . . . GT:LAA:AD 0/0:1,2,3:0,0,0,0 0/0:2,3:0,0,0,0
chr1 3352 . A G . . . GT:PL 0/0:30,30,30 0/0:30,60,0
chr1 3353 . A G,T,C . . . GT:PL 0:0,0,30,0 1:0,0,0,0
chr1 3354 . A G . . . GT:LAA:LPL:PL 0/0:1:30,30,30:40,40,40 0/0::30,30,30:40,40,40
chr1 3356 . G C,T . . . GT:LAA:PL 0/0:2:10,20,30,40,50,60 1/1:2:10,20,30,40,50,60
chr1 3357 . A G . . . GT:PL 0/0:30,30,30 0/0:30,.,0
chr1 3358 . A G . . . GT:PL 0/0:. 0/0:.
chr1 3359 . A G . . . GT:AD 0:20,10 1:0,20

These should cover the following cases:

  • LAA field absent/present,
  • LPL field absent/present,
  • PL field absent/present,
  • some, but not all, PL values missing,
  • monoploid and diploid.

References

@jeromekelleher
Copy link
Contributor

I think the test failures are related to #258 - the specific numbers we're checking against here depend on the version of numpy. Looks like we need to explicitly pin a numpy version somewhere.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, thanks @Will-Tyler. I've left a few minor comments.

It would be good to have a source of external truth here for computing these values. It's a pity that the bcftools thing didn't work out.

@@ -475,13 +515,18 @@ def test_ok_without_local_alleles(self, ds):

class Test1000G2020Example:
data_path = "tests/data/vcf/1kg_2020_chrM.vcf.gz"
expected_call_lpl_path = "tests/data/numpy/expected_call_LPL.npy"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no so sure about how well this works as a testing strategy. I think testing against pre-computed values like this works fine when it's for small examples and you can see them in the code. But here, as a code reader, I don't know where this array comes from and it doesn't give me any reassurance that the values have been computed correctly.

I think a better approach would be to keep a copy of the 1kg_2020_chrM.vcf file that has these LPL values computed by some other method. Is there any way we can do this, other than using Hail? (Which would be a faff, I guess)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the expected values in a file because the ruff formater was reformatting the expected values into a giant mess. I will put the expected values back into the Python code and try disabling the ruff linter / formatter for the expected values. If we don't like that, maybe we can consider removing the test case. I'm not aware of any other ways to compute the local-allele fields. I checked bcftools again—they updated the tag2tag plugin yesterday, but it still does not support XX to LXX conversions.

I did manually inspect the expected values to make sure that all looked correct in comparison with the test dataset. Even if the test takes some manual effort to verify the first time, I think the test is still valuable for detecting changes.

[154, 22, 0, -2, -2, -2, -2, -2, -2, -2],
],
[
[1002, 55, 1002, 0, 55, 1002, -2, -2, -2, -2],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimensions don't look right to me - why do we have so much padding? Shouldn't the inner dimension be 6 here, not 10? (That's the main point of the exercise here, to reduce the size of the last dimension in PL fields)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner dimension is 10 because at POS 3351, the first sample has a diploid genotype and three alternate allele indices. (I have pasted the test data in this pull request's description for easy reference.) Therefore, the LPL field should have ten entries, one for each possible genotype from the alleles, including the reference allele. In this case, all of the values are the missing value because the entry does not have a PL field.

I think that if we reduced the size by truncating the data, it would be more complicated for data consumers. We also need to support the case where only some, but not all of the PL values are missing, right? If so, I can add a test case for this.

@@ -1183,6 +1247,25 @@ def process_partition(self, partition_index):
else:
format_fields.append(field)

# We need to determine LAA before LPL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's taken me a good while to figure this out - maybe we could express it more simply?

if "LAA" in format_fields and "LPL" in format_fields:
     laa_index = format_fields.index("LAA")
     lpl_index = format_fields.index("LPL")
     # LAA needs to come before LPL
     if lpl_index < laa_index:
          format_fields[laa_index] = "LPL"
          format_fields[lpl_index] = "LAA"

That's the same thing, right? Let's now worry about optimising for reducing the number of scans through the format_fields array.

@Will-Tyler
Copy link
Contributor Author

Will-Tyler commented Jul 16, 2024

I'm not sure that the test failures depend on the NumPy version. When I change my environment to use Python 3.10, the tests pass. When I change the environment to use Python 3.11, I get the failures seen in the CI. In both cases, I used NumPy version 1.26.0.


EDIT: I believe the breaking change is in numcodecs 0.13.0, which includes a change to the zstd compression algorithm (release notes). When I use version 0.13.0, I get the test failure. When I use version 0.12.1, the same tests pass. Version 0.13.0 was released on July 12th (source), the day that I opened this PR.

@coveralls
Copy link
Collaborator

coveralls commented Jul 16, 2024

Coverage Status

coverage: 98.911% (+0.03%) from 98.886%
when pulling 7e88757 on Will-Tyler:local-alleles
into 4366ec1 on sgkit-dev:main.

Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @Will-Tyler! I think I need to play with this a bit now and see how it goes on some real datasets.

I think we just need to drop the pin on numcodecs (issued fixed in #275) and we're good to merge.

@Will-Tyler
Copy link
Contributor Author

Should be rebased now with the numcodecs version unpinned. I'm excited to see how well this works in practice and happy to implement needed improvements that you identify. Thanks for reviewing!

@jeromekelleher jeromekelleher merged commit 5105ffe into sgkit-dev:main Jul 17, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants