Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate odd csv -> json-ld behaviour #368

Open
aclayton555 opened this issue Apr 2, 2024 · 12 comments
Open

Investigate odd csv -> json-ld behaviour #368

aclayton555 opened this issue Apr 2, 2024 · 12 comments
Labels

Comments

@aclayton555
Copy link
Contributor

Ticket for us to look into potential causes related to a couple of issues that DCC members have seen where the case of certain valid values (e.g. "Yes" vs "yes) is throwing errors.

Recent example: https://sagebionetworks.jira.com/browse/HTAN-402

Alex also mentioned that he encountered issues with this when recently interacting with the Publications schema.

@aclayton555
Copy link
Contributor Author

scope of this is to do a little digging to see if there is something going on that we can/need to fix. Upon investigation, decide whether further escalation to FAIR Data is needed

@adamjtaylor
Copy link
Contributor

The linked ticket starts with an issue with Skin of lower limb and hip

In HTAN.model.jsonld the displayName for Skinoflowerlimbandhip is "skin of lower limb and hip" (lowercase s). Not sure if that could be the cause.

Skin of lower limb and hip(case sensitive) is a valid value for 5 attributes in the data model. [Progression or Recurrence Anatomic Site, Treatment Anatomic Site, Site of Resection or Biopsy, Tissue or Organ of Origin, Additional Topography]
skin of lower limb and hip(case sensitive) is a valid value for 1 attributes in the data model: [Melanoma Biopsy Resection Sites]

My guess is that this is leading to a namespace clash on conversion to the JSON-LD.

Next step is to investigate if skin of lower limb and hip has been already used for Melanoma Biopsy Resection Sites in existing or released metadata.

@adamjtaylor
Copy link
Contributor

adamjtaylor commented Apr 10, 2024

This essentially becomes a clone/offshoot of our investigation into title/lower-case near duplictes and actions needed in backlog for #176

@adamjtaylor
Copy link
Contributor

adamjtaylor commented Apr 10, 2024

Note all the Melanoma Biopsy Resection Sites valid values appear to be in lowercase and are likely causing clashes with other attributes

Melanoma Biopsy Resection Sites,Biopsy resection sites specific to melanoma (not covered in Tiers 1 and 2),"skin of scalp, skin of eye lid, skin of nose, skin of lip, skin of ear, skin of neck, skin of other parts of face, skin of chest, skin of back, skin of abdomen, skin of trunk-other, skin of breast, skin of upper limb and shoulder, skin of palm, skin of lower limb and hip, skin of sole, skin of penis, skin of scrotum, skin of vulva, skin other, skin NOS, Not Reported",,,FALSE,Melanoma Tier 3,,,

"skin of scalp, skin of eye lid, skin of nose, skin of lip, skin of ear, skin of neck, skin of other parts of face, skin of chest, skin of back, skin of abdomen, skin of trunk-other, skin of breast, skin of upper limb and shoulder, skin of palm, skin of lower limb and hip, skin of sole, skin of penis, skin of scrotum, skin of vulva, skin other, skin NOS, Not Reported",

@adamjtaylor
Copy link
Contributor

Melanoma Biopsy Resection Sites is part of the MelanomaTier3 component

We count the number of distinct entries for this from google BigQuery

SELECT 
  DISTINCT(Melanoma_Biopsy_Resection_Sites), 
  COUNT(*) as n 
FROM `htan-dcc.combined_assays.MelanomaTier3` 
GROUP BY Melanoma_Biopsy_Resection_Sites
ORDER BY n DESC
Melanoma_Biopsy_Resection_Sites n
Not Reported 17
Skin of upper limb and shoulder 13
Skin of back 9
Skin NOS 8
Skin of lower limb and hip 6
Skin of scalp 3
Skin of abdomen 2
Skin of sole 1
Skin of vulva 1
Skin of ear 1
skin other 1
Skin of chest 1

These are all in first-letter-uppercase suggesting that the lowercase valid values in the data model have not been followed (maybe they were not implemented at the time?

The only weird thing is that skin other is lowercase. To confirm how this is appearing in the date model.

@adamjtaylor
Copy link
Contributor

The only occurrence of skin other (case insensitive) in the data model is for Melanoma Biopsy Resection Sites

@adamjtaylor
Copy link
Contributor

All the other actual values submitted for Melanoma Biopsy Resection Sites appear either in Site of Resection or Biopsy (eg Skin of lower limb and hip) or Additional Topography (eg Skin of sole)

Note Additional Topography appears to be only used in the SRRS Biospecimen component - so just for the SRRS TNP and not for general HTAN center usage

@adamjtaylor
Copy link
Contributor

adamjtaylor commented Apr 10, 2024

Next will look at Yes vs yes

@adamjtaylor
Copy link
Contributor

In the CSV: "yes" is a valid value for "Treatment or Therapy" only where as title "Yes" is more frequently used

In the data model we see that upper case Yes is used in the valid value within the JSON-LD

            "@id": "bts:TreatmentorTherapy",
            "@type": "rdfs:Class",
            "rdfs:comment": "A yes/no/unknown/not applicable indicator related to the administration of therapeutic agents received.",
            "rdfs:label": "TreatmentorTherapy",
            "rdfs:subClassOf": [
                {
                    "@id": "bts:Therapy"
                }
            ],
            "schema:isPartOf": {
                "@id": "http://schema.biothings.io"
            },
            "schema:rangeIncludes": [
                {
                    "@id": "bts:Yes"
                },
                {
                    "@id": "bts:No"
                },
                {
                    "@id": "bts:Unknown"
                },
                {
                    "@id": "bts:NotReported"
                }
            ],
            "sms:displayName": "Treatment or Therapy",
            "sms:required": "sms:false",
            "sms:validationRules": []
        },

@adamjtaylor
Copy link
Contributor

My hypothesis is that where there are case differences the JSON-LD converter is now harmonising based on the title case version. I wonder if in the past it took both, or harmonized in the lower case version.

Action for next sprint. Escalate to FAIR. Suggest @aditigopalan work with them to confirm this hypothesis or understand how cases for the JSON-LD

@adamjtaylor
Copy link
Contributor

Looking back to Aug 2023 data model release I don't see a change in behavior

@aclayton555
Copy link
Contributor Author

This is a problem we will need to engage with FAIR Data on in the future to figure out how to clean this up based on latest expected behavior of schematic. Push this back to baclog and mark for renewal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants