Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

NoopDog · 2021-04-14T16:55:05Z

The Library Construction Approach facet in the data browser has several terms that are either incorrectly labeled or insufficiently specific, cluttering up the list.

For example, the 10x family of library construction approaches the browser lists:

10X 3' v1 sequencing
10x 3' v2
10X 3' v2 sequencing
10x 3' v3 sequencing
10X 3' v3 sequencing
10X 5' v2 sequencing
10X Ig enrichment
10X TCR enrichment
10X v2 sequencing
10x v3 sequencing

See the 20210401_dcp4-Library-Preparation-Protocols Spreadsheet for a report with the full list of library preparation protocol documents used in the metadata.

The above list contains several classes of errors that should be fixed and may require changes to validation or ingest/wrangling SOP to prevent them from happening again.

Note that in addition to TDR snapshots and Azul indexes, the incorrect ontology terms are also likely in the DCP Generated Matrices' embedded metadata. We will need to validate the DCP generated matrices, and if necessary, come up with an efficient approach for updating the metadata.

Expected Outcome

Using the correct and most specific ontology terms available, we should be able to trim the above list to:

10X 3' v1 sequencing
10X 3' v2 sequencing
10x 3' v3 sequencing
10X 5' v2 sequencing
10X Ig enrichment
10X TCR enrichment

Note that since 10X Ig enrichment and 10X TCR enrichment are subclasses of 10X 5' v2 sequencing, we may be able to eliminate 10X 5' v2 sequencing as well.

Background

library_preparation_protocol.library_construction_method is defined to have a graph restriction: Subclasses of OBI:0000711 from obo:efo.

See EBI OLS EFO / OBI_0000711 for the ontology terms we use to define this field.

The value of the library_preparation_protocol.library_construction_method is a library_construction_ontology entity which defines the following fields

Field	Description
library_construction_ontology.ontology	An ontology term identifier in the form prefix:accession. For example, "EFO:0009310" or "EFO:0008931
library_construction_ontology.ontology_label	(string) The preferred label for the ontology term referred to in the ontology field. This may differ from the user-supplied value in the text field. For example "10X v2 sequencing" or "Smart-seq
library_construction_ontology.text	(string) The name of a library construction approach being used. For example "10X v2 sequencing" or "Smart-seq2".

When Azul indexes this field, it uses ontology_label if present, text if not. And if neither is present, it's ontology (the term reference).

Error Types

Looking at the spreadsheet above, it appears there are several classes of problems to be addressed:

Type	Description	Example
1	Incorrect ongology_label	e.g. using DroNc-Seq instead of DroNc-seq, 10x 3'v2 instead of 10X 3' v2 sequencing
2	Using ontology identifier when a more specific term is available.	e.g using 10X v2 sequencing (EFO:0009310) instead of a more specific term that specifies the end_bias such as (EFO_0009899)
3	mismatch of ontology_label and ontology_term	e.g. label is 10X 3' v2 sequencing and text is 10X 5' v2 sequencing (Row 66)

We may also have internal consistency errors that show up with further validation, for example, where the end_bias does not match the ontology term.

Possible Discussion Points

What is the best way to find, report, track, and fix these kinds of errors and create a work queue for resolving them?
Where might we add validation to prevent incorrect ontology terms and labels?
What validations are required, and how might they be specified and implemented? For example:
1. How can we specify when non-leaf nodes should be disallowed as ontology terms? For example, how could we specify that 10x 5’ v2 sequencing is allowed, but 10x v2 sequencing is not?
2. What is the purpose of the text field when the ontology label is provided? Should we be concerned when there is an apparent mismatch between the ontology label and the text?
Should we more aggressively use hcao to add terms where they are missing in the core ontologies. For example, to prevent "nulls" in the ontology and ontology text fields.
Can/should we fix the incorrect metadata that has made it into DCP generated matrices.

Notes

The query for the above spreadsheet is listed below. The query could be modified to look for similar errors in other ontologized fields.

SELECT
  protocol_project.project_id,
  library_preparation_protocol_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology") AS ontology_id,
  json_extract_scalar(content,
    "$.library_construction_method.ontology_label") AS ontology_label,
  json_extract_scalar(content,
    "$.library_construction_method.text") AS text,
  json_extract_scalar(content,
    "$.end_bias") AS end_bias
FROM
  `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.library_preparation_protocol` AS library_preparation_protocol
FULL JOIN (
  SELECT
    DISTINCT *
  FROM (
    SELECT
      project_id,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_type") AS protocol_type,
      JSON_EXTRACT_SCALAR(protocol,
        "$.protocol_id") AS protocol_id,
    FROM
      `broad-datarepo-terra-prod-hca2.hca_prod_20201120_dcp2___20210401_dcp4.links`
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(content,
          "$.links")) AS process
    LEFT JOIN
      UNNEST(JSON_EXTRACT_ARRAY(process,
          "$.protocols")) AS protocol ) AS protocol_project
  WHERE
    protocol_type = "library_preparation_protocol") AS protocol_project
ON
  protocol_project.protocol_id = library_preparation_protocol.library_preparation_protocol_id
ORDER BY
  ontology_id,
  library_preparation_protocol_id

The text was updated successfully, but these errors were encountered:

mshadbolt · 2021-04-15T00:10:00Z

just wanted to note that 10X 5' v2 sequencing is specifically gene expression whereas the sub-terms are enrichment of that kind of library, so we can't remove that term.

hannes-ucsc · 2021-04-21T17:29:37Z

I thought the sub-term relation expressed an "is a" relation. If that's the case, we would need a more specific sub term.

If we want to represent an apple, and the fruit term only has pear and orange sub-terms, then we shouldn't use fruit but instead add apple as a sub-term of fruit and use that.

NoopDog · 2021-04-22T02:47:26Z

@theathorn @hannes-ucsc @mshadbolt I added possible discussion points above for when we gather to discuss.

theathorn · 2021-05-03T19:19:21Z

Slack thread.

ESapenaVentura · 2021-05-05T08:04:18Z

I thought the sub-term relation expressed an "is a" relation. If that's the case, we would need a more specific sub term.

If we want to represent an apple, and the fruit term only has pear and orange sub-terms, then we shouldn't use fruit but instead add apple as a sub-term of fruit and use that.

10x 5' (v2 and v3) represent the gene expression from 5' end of the whole set of transcripts of a cell. 5' Ig and TCR enrichment are taking a part of that whole set of transcripts and enriching for the sequences that are translated as T cell receptors and Immunoglobulins; these transcripts have sequences that can be identified and, therefore, can be enriched and separated.

Following the analogy, it would be more of a situation of representing an apple and its seeds. The seeds are a part of the apple, but you still need the apple term to represent the whole.

Happy to discuss if we need to change the way we label it to make it more clear, but I don't think the current way is incorrect

hannes-ucsc · 2021-05-05T16:40:26Z

I see, the old "is a" vs "part of" conflation. How do I as a consumer of the ontology distinguish between term relationships that represent inheritance (is a) vs ones that express an aggregation (part of)?

kbergin · 2021-05-06T15:37:55Z

Just a brief comment - in your main post, in the 'expected ontology' narrowed down list, these two are the same except capitalizations, so they can be condensed.

10x 3' v3 sequencing
10X 3' v3 sequencing

NoopDog · 2021-05-07T17:28:04Z

Notes from our May 6 call

We agreed that:

For problem 1 above - incorrect ontology label

Wranglers will update the metadata in an upcoming release.
Azul team will investigate resolving ontology labels from the ontology term ID during/prior to indexing.

For problem 2 above - non-leaf ontology terms used

Wranglers will update the non-leaf terms to the leaf term in an upcoming release.

For missing ontologies

Wranglers will backfill any null library_construction_ontology.ontology where an appropriate ontology term now exists.

To track the wrangler work

We will create tickets in the HCA EBI Wrangler Central GitHub repository.

To add traceability to these DCP-wide activities

Epics will be created in the DCP2 (this) repo and we will link the epics to the Wrangler Central issues.

TODO

We need to determine if and how we will update the DCP-generated matrices affected by the metadata updates.

NoopDog · 2021-05-25T18:30:21Z

Notes from May 25 Call

Azul to look up ontology terms at indexing time from ontology ids. Use ontology id to lookup label DataBiosphere/azul#3076
Azul to return ontology ids along with term names in Azul responses so the data browser can display the term on hover and link out to the term definition. Add ontology id to entities response DataBiosphere/azul#3078
TODO Confirm that library construction approaches are mutually exclusive, e.g. only one would appear where the term is used and that the relationship between child terms and parent terms is "is a".
TODO Close tickets in wrangler central related to incorrect ontology labels as these will be fixed in the browser by Use ontology id to lookup label DataBiosphere/azul#3076
TODO Schedule remaining wrangler central tickets to a data release. (using an ontology where none existed, using the leaf term)
TODO Determine if EFO/OBI supports versioning or what the recommended practice is to say we are using the ontology as it existed on a given date.
Convert this issue to a ZenHub epic and make move blocking tickets to sub-tickets.

hannes-ucsc · 2021-05-26T18:18:55Z

Epics shouldn't be blocked by the individual tickets, otherwise the filtering by Epic doesn't work. Those tickets should be part of the epic. I'll fix this.

hannes-ucsc · 2022-04-20T17:45:50Z

There are no orange tickets in this epic. Taking off the orange label.

Many of the issues in this epic were closed in favor of a programmatic solution (epic #3079).

theathorn added orange [process] Done by the Azul team bug [type] A defect preventing use of the system as specified labels Apr 14, 2021

theathorn assigned NoopDog Apr 14, 2021

theathorn added the spike:3 [process] Spike estimate of three points label Apr 14, 2021

hannes-ucsc added a commit that referenced this issue Apr 21, 2021

Specify the production/consumption of ontologized properties (#13)

b0cbbc3

hannes-ucsc added a commit that referenced this issue Apr 21, 2021

Specify the production/consumption of ontologized properties (#13)

fd79ddc

hannes-ucsc added a commit that referenced this issue Apr 21, 2021

Specify the production/consumption of ontologized properties (#13)

7da96a3

hannes-ucsc added a commit that referenced this issue Apr 21, 2021

Specify the production/consumption of ontologized properties (#13)

bb019b9

theathorn changed the title ~~Metadatata contains incorrect values for library_preparation_protocol.library_construction_method~~ Metadata contains incorrect values for library_preparation_protocol.library_construction_method May 3, 2021

hannes-ucsc added a commit that referenced this issue May 11, 2021

Specify the production/consumption of ontologized properties (#13)

3ac9d67

mshadbolt mentioned this issue May 12, 2021

Ontology curation error fixes ebi-ait/hca-ebi-wrangler-central#322

Open

25 tasks

theathorn mentioned this issue May 25, 2021

Use ontology id to lookup label DataBiosphere/azul#3076

Open

NoopDog added the epic [type] Issue consists of multiple smaller issues label May 25, 2021

hannes-ucsc added a commit that referenced this issue Jun 16, 2021

Specify the production/consumption of ontologized properties (#13)

08574f8

hannes-ucsc added a commit that referenced this issue Jun 17, 2021

Specify the production/consumption of ontologized properties (#13)

4c51feb

hannes-ucsc added a commit that referenced this issue Aug 10, 2021

Specify the production/consumption of ontologized properties (#13)

9491dd6

hannes-ucsc added a commit that referenced this issue Sep 2, 2021

Specify the production/consumption of ontologized properties (#13)

d0841cc

hannes-ucsc added a commit that referenced this issue Oct 4, 2021

Specify the production/consumption of ontologized properties (#13)

88f9135

hannes-ucsc added a commit that referenced this issue Oct 7, 2021

Specify the production/consumption of ontologized properties (#13)

9b0ebe4

hannes-ucsc removed the orange [process] Done by the Azul team label Apr 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

NoopDog commented Apr 14, 2021 •

edited

Loading

mshadbolt commented Apr 15, 2021

hannes-ucsc commented Apr 21, 2021 •

edited

Loading

NoopDog commented Apr 22, 2021 •

edited

Loading

theathorn commented May 3, 2021

ESapenaVentura commented May 5, 2021 •

edited

Loading

hannes-ucsc commented May 5, 2021

kbergin commented May 6, 2021

NoopDog commented May 7, 2021

NoopDog commented May 25, 2021

hannes-ucsc commented May 26, 2021

hannes-ucsc commented Apr 20, 2022

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

Metadata contains incorrect values for library_preparation_protocol.library_construction_method #13

Comments

NoopDog commented Apr 14, 2021 • edited Loading

Expected Outcome

Background

Error Types

Possible Discussion Points

Notes

mshadbolt commented Apr 15, 2021

hannes-ucsc commented Apr 21, 2021 • edited Loading

NoopDog commented Apr 22, 2021 • edited Loading

theathorn commented May 3, 2021

ESapenaVentura commented May 5, 2021 • edited Loading

hannes-ucsc commented May 5, 2021

kbergin commented May 6, 2021

NoopDog commented May 7, 2021

Notes from our May 6 call

For problem 1 above - incorrect ontology label

For problem 2 above - non-leaf ontology terms used

For missing ontologies

To track the wrangler work

To add traceability to these DCP-wide activities

TODO

NoopDog commented May 25, 2021

Notes from May 25 Call

hannes-ucsc commented May 26, 2021

hannes-ucsc commented Apr 20, 2022

NoopDog commented Apr 14, 2021 •

edited

Loading

hannes-ucsc commented Apr 21, 2021 •

edited

Loading

NoopDog commented Apr 22, 2021 •

edited

Loading

ESapenaVentura commented May 5, 2021 •

edited

Loading