-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13
Comments
just wanted to note that |
I thought the sub-term relation expressed an "is a" relation. If that's the case, we would need a more specific sub term. If we want to represent an apple, and the |
@theathorn @hannes-ucsc @mshadbolt I added possible discussion points above for when we gather to discuss. |
Slack thread. |
10x 5' (v2 and v3) represent the gene expression from 5' end of the whole set of transcripts of a cell. 5' Ig and TCR enrichment are taking a part of that whole set of transcripts and enriching for the sequences that are translated as T cell receptors and Immunoglobulins; these transcripts have sequences that can be identified and, therefore, can be enriched and separated. Following the analogy, it would be more of a situation of representing an apple and its seeds. The seeds are a part of the apple, but you still need the apple term to represent the whole. Happy to discuss if we need to change the way we label it to make it more clear, but I don't think the current way is incorrect |
I see, the old "is a" vs "part of" conflation. How do I as a consumer of the ontology distinguish between term relationships that represent inheritance (is a) vs ones that express an aggregation (part of)? |
Just a brief comment - in your main post, in the 'expected ontology' narrowed down list, these two are the same except capitalizations, so they can be condensed. 10x 3' v3 sequencing |
Notes from our May 6 callWe agreed that: For problem 1 above - incorrect ontology label
For problem 2 above - non-leaf ontology terms used
For missing ontologies
To track the wrangler work
To add traceability to these DCP-wide activities
TODO
|
Notes from May 25 Call
|
Epics shouldn't be blocked by the individual tickets, otherwise the filtering by Epic doesn't work. Those tickets should be part of the epic. I'll fix this. |
There are no Many of the issues in this epic were closed in favor of a programmatic solution (epic #3079). |
The Library Construction Approach facet in the data browser has several terms that are either incorrectly labeled or insufficiently specific, cluttering up the list.
For example, the 10x family of library construction approaches the browser lists:
See the 20210401_dcp4-Library-Preparation-Protocols Spreadsheet for a report with the full list of library preparation protocol documents used in the metadata.
The above list contains several classes of errors that should be fixed and may require changes to validation or ingest/wrangling SOP to prevent them from happening again.
Expected Outcome
Using the correct and most specific ontology terms available, we should be able to trim the above list to:
Note that since 10X Ig enrichment and 10X TCR enrichment are subclasses of 10X 5' v2 sequencing, we may be able to eliminate 10X 5' v2 sequencing as well.
Background
library_preparation_protocol.library_construction_method is defined to have a graph restriction: Subclasses of OBI:0000711 from obo:efo.
See EBI OLS EFO / OBI_0000711 for the ontology terms we use to define this field.
The value of the library_preparation_protocol.library_construction_method is a library_construction_ontology entity which defines the following fields
When Azul indexes this field, it uses ontology_label if present, text if not. And if neither is present, it's ontology (the term reference).
Error Types
Looking at the spreadsheet above, it appears there are several classes of problems to be addressed:
We may also have internal consistency errors that show up with further validation, for example, where the end_bias does not match the ontology term.
Possible Discussion Points
What is the best way to find, report, track, and fix these kinds of errors and create a work queue for resolving them?
Where might we add validation to prevent incorrect ontology terms and labels?
What validations are required, and how might they be specified and implemented? For example:
Should we more aggressively use hcao to add terms where they are missing in the core ontologies. For example, to prevent "nulls" in the ontology and ontology text fields.
Can/should we fix the incorrect metadata that has made it into DCP generated matrices.
Notes
The query for the above spreadsheet is listed below. The query could be modified to look for similar errors in other ontologized fields.
The text was updated successfully, but these errors were encountered: