Develop and implement HTAN/CDS seq template #396

aclayton555 · 2024-05-02T18:30:25Z

Historically, we have leveraged the "Other Assay" template as a catch all for data types for which we do not have an assay-specific RFC and component yet available in our data model. This has allowed contributors to proceed with data submission and annotation under "Other assay." At a subsequent date, when the RFC has been performed and the assay-specific component has been implemented in the data model, the HTAN DCC has re-engaged with contributors to update their annotations from the "Other assay" template to the respective assay-specific template.

As we approach the end of HTAN 1.0, we have data types that need to be submitted for which we currently do not have templates in place (e.g. bulk ATACseq). We could have these submitted using the "Other Assay" template, however, we are trying to move away from the use of this template to the extent possible through the end of HTAN 1.0 and have as much data as possible annotated according to the assay-specific template. Furthermore, the "Other Assay" template does not provide sufficient information to allow for mapping and transfer to CDS.

This ticket emerges around the idea of developing a data level-agnostic, minimal sequencing data template as a catch all for remaining expected sequencing data. This template would capture relevant metadata per the existing HTAN data model AND be readily compatible/mapped to the existing CDS seq metadata template to enable transfer of data submitted under this template to CDS (through the remainder of HTAN 1.0).

Re: level-agnostic, the approach here is to create a low-lift template for contributors to complete for their various file types. Once received by the HTAN DCC, the DCC will determine how submitted files should be organized according to existing levels and tiers of controlled access (i.e. if a fastq is submitted, assume L1 and controlled access).

aditigopalan · 2024-05-02T19:03:26Z

keep it level agnostic for contributors to submit, then we will break it down after submission
Clarisse has a mapping file (yaml file in data-release-tracker repo in ncihtan) that maps file from HTAN model to CDS model (starting point for a template)
template could be in sheet based format to begin with

aclayton555 · 2024-05-07T17:13:37Z

Discussed on 2024.05.07 HTAN DCC Ops call:

Consider calling this "Other Sequencing Assay"
Build in attributes from existing "Other Assay" template that provides assay descriptors for the portal. CDS does not currently distinguish single-cell vs bulk, so we can account for this in the template.
Overall, group supportive of this approach

@aditigopalan and @clarisse-lau can you please work together on this? Please let me know if it would be helpful to set up a time to work on this further. THANK YOU!

clarisse-lau · 2024-05-08T00:10:12Z

Thank you @aclayton555 and @aditigopalan! (small correction: the mapping file is in the ncihtan/cds_dbgap repo)
I think it would be helpful to set up a call to talk through things. I have a couple of pending meetings, but this is my current availability this week:

Wednesday 10:30-11:30am,12:30-1:30 PT
Thursday 9am or 12pm PT
Friday 8am, 10am-12pm PT

aditigopalan · 2024-05-09T12:14:52Z

Sorry, I missed this! Are you still available 10:30am PT tomorrow? @clarisse-lau

clarisse-lau · 2024-05-09T13:13:48Z

No worries! 10:30 tomorrow works

aditigopalan · 2024-05-09T13:16:28Z

Just sent you an invite!

clarisse-lau · 2024-05-10T20:15:31Z

A thought on this... Component is one of the source attributes used in the CDS mapping file (to map library_strategy and library_source values. See CDS template)

As we cannot have Component twice in a template, we could instead include the Data Type attribute, which should provide sufficient information to map to the above fields.

aditigopalan · 2024-05-15T13:24:26Z

Here is the template for now, I replaced one of the components with "Data Type"

output.csv

Also here are the attributes: "Last Known Disease Status, Primary Diagnosis, Fixative Type, Treatment Outcome, SizeX, age_at_diagnosis_years, Genomic Reference, Race, NominalMagnification, Days to Recurrence, Morphology, Filename, SizeY, pi_last, Library Selection Method, Days to Last Known Disease Status, file_url_in_cds, pi_first, HTAN_Center, Biospecimen Type, Sequencing Platform, Tseries, Ethnicity, Tissue or Organ of Origin, File Format, SizeZ, PhysicalSizeY, channel_metadata_url, Microscope, Days to Last Follow up, HTAN Data File ID, LensNA, PhysicalSizeX, HTAN Participant ID, Library Layout, Vital Status, Software and Version, Treatment Type, Tumor Tissue Type, Objective, Pyramid, HTAN Biospecimen ID, SizeT, SizeC, Protocol Link, pi_email, Imaging Assay Type, Component, cancer_type, Site of Resection or Biopsy, WorkingDistance, md5, Immersion, Gender, File_Size, Zstack, Progression or Recurrence, Tumor Grade"

Should we re-arrange the fields for clarity? Would IT also help to have a definition of some fields (eg: SizeZ) or would the users be familiar with these names?

@clarisse-lau let me know what you think!

clarisse-lau · 2024-05-15T14:07:55Z

Thank you @aditigopalan !

As this template is intended to be specific to sequencing data, we can remove imaging-related attributes. The genomics mapping goes up to Row 940 of the mapping file, and is followed by mappings for various imaging metadata tables which don't need to be pulled into this CDS template.
The attributes included in the CDS template can be subsetted to those only found in the HTAN Data Model. There are a few 'source' attributes included (e.g. HTAN_Center, file_size, file_url_in_cds) that are actually not part of the HTAN data model, but are instead added at various points in the data flow process (either added to BigQuery tables or as part of the CDS manifest preparation script. sorry for the confusion).

These changes should simplify the CDS template quite a bit, and as it would only include existing data model elements, users will have access to definitions for each field from the HTAN data model.

Some rearranging to align with HTAN template conventions can be done at the implementation stage (i.e. using DependsOn in the data model. keeping it as an unordered csv for now is fine). Typically HTAN manifests start with the following attributes in this order (Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID), followed by all other metadata attributes.

clarisse-lau · 2024-05-15T16:39:25Z

Just had a chat with Ashley & Adam. We'd like to subset the attribute list even further to include only sequencing attributes (plus the descriptor columns: Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID, Data Type).

Clinical/biospecimen fields will be annotated separately by the center and pulled in from those templates respectively (as is currently done in the metadata generation scripts).

aclayton555 · 2024-05-24T17:10:27Z

@aditigopalan just checking on this and if there is anything you need the team to review at this stage. We are aiming to have this implemented and available for the Stanford center to test with the close out of our 24-5 sprint

aditigopalan · 2024-05-28T14:13:34Z

Thanks for checking in! Please let me know if this needs to be subsetted further @aclayton555 @adamjtaylor
output.csv

adamjtaylor · 2024-05-28T14:20:02Z

Thanks @aditigopalan I think we only need to have those attributes that actually come from the sequencing technology as the others will come from our Biospecimen and Clinical elements. So lets drop those and keep:

Genomic Reference
Library layout
Data Type
Sequencing Platform
Library Selection Method

Plus the minimal HTAN columns for a component

HTAN Data File ID
HTAN Parent Biospecimen ID
Filename
File Format

adamjtaylor · 2024-05-28T14:20:35Z

@aditigopalan if you can open a draft PR and link to this issue that would be useful. Thank you!

Fixes #396

aclayton555 · 2024-05-30T19:01:47Z

Add "CDS" prefix to all attributes for this template

adamjtaylor · 2024-06-03T19:57:22Z

Merged! @aditigopalan if you could generate the template using schematic or staging DCA and report if it looks sensible that would be great.

aditigopalan · 2024-06-03T20:14:12Z

@adamjtaylor tested using dca-staging! Looks alright to me.

aclayton555 · 2024-06-03T20:40:42Z

AMAZING (and radical) COLLABORATION ON THIS!

adamjtaylor · 2024-06-03T20:42:17Z

Not quite out of the woods yet! @aditigopalan is chasing down an errant loop in the DAG that is dragging some extra attributes into the template.

aclayton555 assigned aditigopalan May 2, 2024

aditigopalan added a commit that referenced this issue May 28, 2024

Generating CDS template

ac8cc57

Fixes #396

aditigopalan mentioned this issue May 28, 2024

Generating CDS template #409

Merged

aclayton555 mentioned this issue May 28, 2024

HTAN 1.0 Data Model: sunset planning #389

Closed

aclayton555 added the critical label May 30, 2024

adamjtaylor closed this as completed in #409 Jun 3, 2024

aclayton555 mentioned this issue Jun 26, 2024

Errant fields in CDS Sequencing template #434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop and implement HTAN/CDS seq template #396

Develop and implement HTAN/CDS seq template #396

aclayton555 commented May 2, 2024 •

edited

Loading

aditigopalan commented May 2, 2024

aclayton555 commented May 7, 2024

clarisse-lau commented May 8, 2024 •

edited

Loading

aditigopalan commented May 9, 2024

clarisse-lau commented May 9, 2024

aditigopalan commented May 9, 2024

clarisse-lau commented May 10, 2024

aditigopalan commented May 15, 2024

clarisse-lau commented May 15, 2024 •

edited

Loading

clarisse-lau commented May 15, 2024

aclayton555 commented May 24, 2024

aditigopalan commented May 28, 2024

adamjtaylor commented May 28, 2024

adamjtaylor commented May 28, 2024

aclayton555 commented May 30, 2024

adamjtaylor commented Jun 3, 2024

aditigopalan commented Jun 3, 2024

aclayton555 commented Jun 3, 2024

adamjtaylor commented Jun 3, 2024

Develop and implement HTAN/CDS seq template #396

Develop and implement HTAN/CDS seq template #396

Comments

aclayton555 commented May 2, 2024 • edited Loading

aditigopalan commented May 2, 2024

aclayton555 commented May 7, 2024

clarisse-lau commented May 8, 2024 • edited Loading

aditigopalan commented May 9, 2024

clarisse-lau commented May 9, 2024

aditigopalan commented May 9, 2024

clarisse-lau commented May 10, 2024

aditigopalan commented May 15, 2024

clarisse-lau commented May 15, 2024 • edited Loading

clarisse-lau commented May 15, 2024

aclayton555 commented May 24, 2024

aditigopalan commented May 28, 2024

adamjtaylor commented May 28, 2024

adamjtaylor commented May 28, 2024

aclayton555 commented May 30, 2024

adamjtaylor commented Jun 3, 2024

aditigopalan commented Jun 3, 2024

aclayton555 commented Jun 3, 2024

adamjtaylor commented Jun 3, 2024

aclayton555 commented May 2, 2024 •

edited

Loading

clarisse-lau commented May 8, 2024 •

edited

Loading

clarisse-lau commented May 15, 2024 •

edited

Loading