-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validator for writing vcf. #277
Conversation
a6160af
to
131e0ad
Compare
131e0ad
to
7114690
Compare
Codecov Report
@@ Coverage Diff @@
## master #277 +/- ##
==========================================
+ Coverage 88.85% 88.86% +0.01%
==========================================
Files 78 79 +1
Lines 6512 6771 +259
Branches 458 475 +17
==========================================
+ Hits 5786 6017 +231
- Misses 268 279 +11
- Partials 458 475 +17
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this feature! 👍 I think the basic functionality is good.
I have added a few comments as there seems to be room for improvement.
31226a9
to
79954bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating. It's getting much better! 👍
08526be
to
ded6248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your continuous efforts in improving this! 🙏 I added some more comments.
Also, some of my previous comments were overlapped by one of them and may have been missed, so please take them into consideration again.
0f0c36e
to
533f73e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for drafting this PR and sorry for the terribly late reaction to it!
Personally, I'm thinking that the VCF validator namespace should provide functions like the following as a public API, from the aspects of consistency and extensibility:
(make-validator <meta-info> <header> <opts>)
(validate-variant <validator> <variant>)
(validate-variants <validator> <variants>)
Consistency
invalid-variant?
(btw I'd prefer valid-variant?
, but anyway) does more than returning a boolean value, so it shouldn’t be defined as a predicate. Rather, I think It would be nice to align its semantics with validate-variants
’s and provide a functionality that validates a single variant and reports errors if any.
Extensibility
If we add more validation options in the future, it would be tedious to pass those options to every call to validate-variant(s)
(especially when they are used in more than one place). So, I think the validation options should be passed to make-validator
, and validate-variant(s)
should use the options specified for the validator.
Also, validate-variants
could be extended to return a transducer (from the arity without variants), but it already has an optional argument vcf-or-bcf
, which makes it a little bit harder to add another arity for the transducer. In this respect, I think vcf-or-bcf
should be passed to make-validator
as another option, such as :file-type
.
We are highly likely to want to add more options for VCF validation in the future, and a validator will be a good place to put together various information including options.
JFYI : If we really need to have a function that returns a map containing error information, I’d prefer adding another "verb" for the different semantics. For example:
|
19cfea0
to
442d467
Compare
I renamed the api and change |
Thanks for the updates! And sorry that I carelessly suggested using the word "check" without paying attention to the code you already wrote, but currently the word "check" is used interchangeably with the word "validate" (like |
src/cljam/io/vcf/util/validator.clj
Outdated
res (if fmt | ||
(check-entry entry fmt | ||
(count (:alt variant))) | ||
[(format "Key %s not in meta." id)])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meta-information lines are optional, but if they are present then they must be completely well-formed.
(snip) Note that BCF, the binary counterpart of VCF, requires that all entries are present.
It is recommended to include meta-information lines describing the entries used in the body of the VCF file.
A straightforward interpretation of these lines, I think, is that the validator should not raise an error only because the definition of a format key was not found in the meta info when it validates a VCF file.
src/cljam/io/vcf/util/validator.clj
Outdated
[(format (str "Key %s is not contained in ##info fields in " | ||
"the header.") id)]))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
src/cljam/io/vcf/util/validator.clj
Outdated
|
||
(defn- check-entry-type [entry type-str] | ||
((case type-str | ||
"Integer" integer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The VCF spec defines integers to be 32-bit and signed, and disallows the values from
src/cljam/io/vcf/util/validator.clj
Outdated
(:chr variant))]], | ||
:pos [[integer? "Position must be Integer."]], | ||
:ref [[valid-ref? "Must consist of ACGTNacgtn."]] | ||
:alt [[(every-pred seq sequential?) "Must be a sequence."] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If seq
is called on a non-seqable value, it will throw an error. So, you shouldn't use it to check if a value is a sequence. Calling sequential?
is just enough here.
src/cljam/io/vcf/util/validator.clj
Outdated
(defn validate-variants | ||
"Find bad expression in varints and return sequence of the map that | ||
explains bad positions" | ||
([validator variants] (seq (keep validator variants)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seq
here is not necessary.
src/cljam/io/vcf/util/validator.clj
Outdated
(format "Invalid number of elements. Requires %s, but got %d." | ||
(str number) (count entries))))))) | ||
|
||
(defn- check-each-samples [variant samples mformat] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd name it validate-samples
or something like that. Grammatically speaking, we don’t usually put a plural noun after "each".
src/cljam/io/vcf/util/validator.clj
Outdated
(defn validate-variants | ||
"Find bad expression in varints and return sequence of the map that | ||
explains bad positions" | ||
([validator variants] (seq (keep validator variants)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can’t tell which validation error came from which variant, so I think the validation target variant should always be assoc
ed to the error map.
cfc5c11
to
3953f01
Compare
Just rewrote the validator implementation. It should provide basic validation features, but I think it's a bit vulnerable to broken definitions of Info and Genotype fields themselves, so we might want to work on that area as a future work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved this, just for form's sake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athos Thank you for organizing the API! 🎉
It appears to be consistently structured and more user-friendly than before.
The internal implementation also has a greater sense of uniformity and is well-structured.
I agree with this approach, so I approve the PR
Thank you for creating this PR and I apologize for the significant delay in responding to your review request 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve completed the code review.
LGTM! 👍
6a2d342
to
38f33f1
Compare
38f33f1
to
def3c5b
Compare
Thank you all for reviewing! I just squashed the commits. |
I add VCF4.4 validator for writing.
see also #271