Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validator for writing vcf. #277

Merged
merged 2 commits into from
Oct 4, 2023
Merged

Conversation

niyarin
Copy link
Contributor

@niyarin niyarin commented Jun 12, 2023

I add VCF4.4 validator for writing.
see also #271

@niyarin niyarin requested review from athos and a team June 12, 2023 10:16
@niyarin niyarin requested a review from alumi as a code owner June 12, 2023 10:16
@niyarin niyarin requested review from matsutomo81 and removed request for a team and alumi June 12, 2023 10:16
@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch from a6160af to 131e0ad Compare June 12, 2023 10:18
@niyarin niyarin requested a review from a team as a code owner June 12, 2023 10:18
@niyarin niyarin removed the request for review from a team June 12, 2023 10:19
@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch from 131e0ad to 7114690 Compare June 12, 2023 10:21
@codecov
Copy link

codecov bot commented Jun 12, 2023

Codecov Report

Merging #277 (def3c5b) into master (212b52c) will increase coverage by 0.01%.
The diff coverage is 89.18%.

@@            Coverage Diff             @@
##           master     #277      +/-   ##
==========================================
+ Coverage   88.85%   88.86%   +0.01%     
==========================================
  Files          78       79       +1     
  Lines        6512     6771     +259     
  Branches      458      475      +17     
==========================================
+ Hits         5786     6017     +231     
- Misses        268      279      +11     
- Partials      458      475      +17     
Files Coverage Δ
src/cljam/io/vcf/util/validator.clj 89.18% <89.18%> (ø)

@alumi alumi self-requested a review June 14, 2023 06:17
@alumi alumi self-assigned this Jun 14, 2023
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this feature! 👍 I think the basic functionality is good.
I have added a few comments as there seems to be room for improvement.

test/cljam/io/vcf/util/validator_test.clj Outdated Show resolved Hide resolved
test/cljam/io/vcf/util/validator_test.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
test/cljam/io/vcf/util/validator_test.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch from 31226a9 to 79954bf Compare July 4, 2023 02:17
@alumi alumi self-requested a review July 5, 2023 02:27
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating. It's getting much better! 👍

src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch 2 times, most recently from 08526be to ded6248 Compare July 19, 2023 05:03
@alumi alumi self-requested a review July 20, 2023 03:14
Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your continuous efforts in improving this! 🙏 I added some more comments.
Also, some of my previous comments were overlapped by one of them and may have been missed, so please take them into consideration again.

test/cljam/io/vcf/util/validator_test.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch from 0f0c36e to 533f73e Compare July 27, 2023 23:00
@alumi alumi self-requested a review July 31, 2023 02:23
Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this PR and sorry for the terribly late reaction to it!

Personally, I'm thinking that the VCF validator namespace should provide functions like the following as a public API, from the aspects of consistency and extensibility:

  • (make-validator <meta-info> <header> <opts>)
  • (validate-variant <validator> <variant>)
  • (validate-variants <validator> <variants>)

Consistency

invalid-variant? (btw I'd prefer valid-variant?, but anyway) does more than returning a boolean value, so it shouldn’t be defined as a predicate. Rather, I think It would be nice to align its semantics with validate-variants’s and provide a functionality that validates a single variant and reports errors if any.

Extensibility

If we add more validation options in the future, it would be tedious to pass those options to every call to validate-variant(s) (especially when they are used in more than one place). So, I think the validation options should be passed to make-validator, and validate-variant(s) should use the options specified for the validator.

Also, validate-variants could be extended to return a transducer (from the arity without variants), but it already has an optional argument vcf-or-bcf, which makes it a little bit harder to add another arity for the transducer. In this respect, I think vcf-or-bcf should be passed to make-validator as another option, such as :file-type.

We are highly likely to want to add more options for VCF validation in the future, and a validator will be a good place to put together various information including options.

@athos
Copy link
Member

athos commented Aug 4, 2023

JFYI : If we really need to have a function that returns a map containing error information, I’d prefer adding another "verb" for the different semantics. For example:

  • (validate-variant <validator> <variant>): Returns a map that contains error information if any. Otherwise returns nil.
  • (check-variant <validator> <variant>): Throws an error if the variant is invalid.

@niyarin niyarin force-pushed the feature/check-writing-valid-vcf branch from 19cfea0 to 442d467 Compare August 8, 2023 02:07
@niyarin
Copy link
Contributor Author

niyarin commented Aug 8, 2023

I renamed the api and change make-validator to public.

@athos
Copy link
Member

athos commented Aug 10, 2023

Thanks for the updates!

And sorry that I carelessly suggested using the word "check" without paying attention to the code you already wrote, but currently the word "check" is used interchangeably with the word "validate" (like check-base-records, check-entry-type etc.). We should draw a clear line between how we use those two words.

res (if fmt
(check-entry entry fmt
(count (:alt variant)))
[(format "Key %s not in meta." id)])]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCF spec v4.4 says:

Meta-information lines are optional, but if they are present then they must be completely well-formed.
(snip) Note that BCF, the binary counterpart of VCF, requires that all entries are present.
It is recommended to include meta-information lines describing the entries used in the body of the VCF file.

A straightforward interpretation of these lines, I think, is that the validator should not raise an error only because the definition of a format key was not found in the meta info when it validates a VCF file.

Comment on lines 142 to 143
[(format (str "Key %s is not contained in ##info fields in "
"the header.") id)])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.


(defn- check-entry-type [entry type-str]
((case type-str
"Integer" integer?
Copy link
Member

@athos athos Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VCF spec defines integers to be 32-bit and signed, and disallows the values from $-2^{31}$ to $-2^{31}+7$, so the validator should check if the integer value is within the valid range.

(:chr variant))]],
:pos [[integer? "Position must be Integer."]],
:ref [[valid-ref? "Must consist of ACGTNacgtn."]]
:alt [[(every-pred seq sequential?) "Must be a sequence."]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If seq is called on a non-seqable value, it will throw an error. So, you shouldn't use it to check if a value is a sequence. Calling sequential? is just enough here.

(defn validate-variants
"Find bad expression in varints and return sequence of the map that
explains bad positions"
([validator variants] (seq (keep validator variants))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seq here is not necessary.

(format "Invalid number of elements. Requires %s, but got %d."
(str number) (count entries)))))))

(defn- check-each-samples [variant samples mformat]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd name it validate-samples or something like that. Grammatically speaking, we don’t usually put a plural noun after "each".

(defn validate-variants
"Find bad expression in varints and return sequence of the map that
explains bad positions"
([validator variants] (seq (keep validator variants))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can’t tell which validation error came from which variant, so I think the validation target variant should always be assoced to the error map.

@athos athos force-pushed the feature/check-writing-valid-vcf branch 4 times, most recently from cfc5c11 to 3953f01 Compare August 29, 2023 01:20
@athos
Copy link
Member

athos commented Aug 29, 2023

Just rewrote the validator implementation.

It should provide basic validation features, but I think it's a bit vulnerable to broken definitions of Info and Genotype fields themselves, so we might want to work on that area as a future work.

Copy link
Member

@athos athos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved this, just for form's sake.

Copy link
Member

@alumi alumi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athos Thank you for organizing the API! 🎉
It appears to be consistently structured and more user-friendly than before.
The internal implementation also has a greater sense of uniformity and is well-structured.
I agree with this approach, so I approve the PR

src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
test/cljam/io/vcf/util/validator_test.clj Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Outdated Show resolved Hide resolved
src/cljam/io/vcf/util/validator.clj Show resolved Hide resolved
@matsutomo81
Copy link
Contributor

Thank you for creating this PR and I apologize for the significant delay in responding to your review request 🙏
I haven't been able to review all of the code yet, but I commented on a few points I've noticed so far.

Copy link
Contributor

@matsutomo81 matsutomo81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve completed the code review.
LGTM! 👍

@athos athos force-pushed the feature/check-writing-valid-vcf branch from 6a2d342 to 38f33f1 Compare October 4, 2023 04:39
@athos athos force-pushed the feature/check-writing-valid-vcf branch from 38f33f1 to def3c5b Compare October 4, 2023 04:45
@athos
Copy link
Member

athos commented Oct 4, 2023

Thank you all for reviewing! I just squashed the commits.

@alumi alumi merged commit 2b263ef into master Oct 4, 2023
17 checks passed
@alumi alumi deleted the feature/check-writing-valid-vcf branch October 4, 2023 04:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants