Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop and implement HTAN/CDS seq template #396

Closed
aclayton555 opened this issue May 2, 2024 · 19 comments · Fixed by #409
Closed

Develop and implement HTAN/CDS seq template #396

aclayton555 opened this issue May 2, 2024 · 19 comments · Fixed by #409
Assignees
Labels

Comments

@aclayton555
Copy link
Contributor

aclayton555 commented May 2, 2024

Historically, we have leveraged the "Other Assay" template as a catch all for data types for which we do not have an assay-specific RFC and component yet available in our data model. This has allowed contributors to proceed with data submission and annotation under "Other assay." At a subsequent date, when the RFC has been performed and the assay-specific component has been implemented in the data model, the HTAN DCC has re-engaged with contributors to update their annotations from the "Other assay" template to the respective assay-specific template.

As we approach the end of HTAN 1.0, we have data types that need to be submitted for which we currently do not have templates in place (e.g. bulk ATACseq). We could have these submitted using the "Other Assay" template, however, we are trying to move away from the use of this template to the extent possible through the end of HTAN 1.0 and have as much data as possible annotated according to the assay-specific template. Furthermore, the "Other Assay" template does not provide sufficient information to allow for mapping and transfer to CDS.

This ticket emerges around the idea of developing a data level-agnostic, minimal sequencing data template as a catch all for remaining expected sequencing data. This template would capture relevant metadata per the existing HTAN data model AND be readily compatible/mapped to the existing CDS seq metadata template to enable transfer of data submitted under this template to CDS (through the remainder of HTAN 1.0).

Re: level-agnostic, the approach here is to create a low-lift template for contributors to complete for their various file types. Once received by the HTAN DCC, the DCC will determine how submitted files should be organized according to existing levels and tiers of controlled access (i.e. if a fastq is submitted, assume L1 and controlled access).

@aditigopalan
Copy link
Contributor

  • keep it level agnostic for contributors to submit, then we will break it down after submission
  • Clarisse has a mapping file (yaml file in data-release-tracker repo in ncihtan) that maps file from HTAN model to CDS model (starting point for a template)
  • template could be in sheet based format to begin with

@aclayton555
Copy link
Contributor Author

Discussed on 2024.05.07 HTAN DCC Ops call:

  • Consider calling this "Other Sequencing Assay"
  • Build in attributes from existing "Other Assay" template that provides assay descriptors for the portal. CDS does not currently distinguish single-cell vs bulk, so we can account for this in the template.
  • Overall, group supportive of this approach

@aditigopalan and @clarisse-lau can you please work together on this? Please let me know if it would be helpful to set up a time to work on this further. THANK YOU!

@clarisse-lau
Copy link
Contributor

clarisse-lau commented May 8, 2024

Thank you @aclayton555 and @aditigopalan! (small correction: the mapping file is in the ncihtan/cds_dbgap repo)
I think it would be helpful to set up a call to talk through things. I have a couple of pending meetings, but this is my current availability this week:

Wednesday 10:30-11:30am,12:30-1:30 PT
Thursday 9am or 12pm PT
Friday 8am, 10am-12pm PT

@aditigopalan
Copy link
Contributor

Sorry, I missed this! Are you still available 10:30am PT tomorrow? @clarisse-lau

@clarisse-lau
Copy link
Contributor

No worries! 10:30 tomorrow works

@aditigopalan
Copy link
Contributor

Just sent you an invite!

@clarisse-lau
Copy link
Contributor

A thought on this... Component is one of the source attributes used in the CDS mapping file (to map library_strategy and library_source values. See CDS template)

As we cannot have Component twice in a template, we could instead include the Data Type attribute, which should provide sufficient information to map to the above fields.

@aditigopalan
Copy link
Contributor

Here is the template for now, I replaced one of the components with "Data Type"

output.csv

Also here are the attributes: "Last Known Disease Status, Primary Diagnosis, Fixative Type, Treatment Outcome, SizeX, age_at_diagnosis_years, Genomic Reference, Race, NominalMagnification, Days to Recurrence, Morphology, Filename, SizeY, pi_last, Library Selection Method, Days to Last Known Disease Status, file_url_in_cds, pi_first, HTAN_Center, Biospecimen Type, Sequencing Platform, Tseries, Ethnicity, Tissue or Organ of Origin, File Format, SizeZ, PhysicalSizeY, channel_metadata_url, Microscope, Days to Last Follow up, HTAN Data File ID, LensNA, PhysicalSizeX, HTAN Participant ID, Library Layout, Vital Status, Software and Version, Treatment Type, Tumor Tissue Type, Objective, Pyramid, HTAN Biospecimen ID, SizeT, SizeC, Protocol Link, pi_email, Imaging Assay Type, Component, cancer_type, Site of Resection or Biopsy, WorkingDistance, md5, Immersion, Gender, File_Size, Zstack, Progression or Recurrence, Tumor Grade"

Should we re-arrange the fields for clarity? Would IT also help to have a definition of some fields (eg: SizeZ) or would the users be familiar with these names?

@clarisse-lau let me know what you think!

@clarisse-lau
Copy link
Contributor

clarisse-lau commented May 15, 2024

Thank you @aditigopalan !

  • As this template is intended to be specific to sequencing data, we can remove imaging-related attributes. The genomics mapping goes up to Row 940 of the mapping file, and is followed by mappings for various imaging metadata tables which don't need to be pulled into this CDS template.
  • The attributes included in the CDS template can be subsetted to those only found in the HTAN Data Model. There are a few 'source' attributes included (e.g. HTAN_Center, file_size, file_url_in_cds) that are actually not part of the HTAN data model, but are instead added at various points in the data flow process (either added to BigQuery tables or as part of the CDS manifest preparation script. sorry for the confusion).

These changes should simplify the CDS template quite a bit, and as it would only include existing data model elements, users will have access to definitions for each field from the HTAN data model.

Some rearranging to align with HTAN template conventions can be done at the implementation stage (i.e. using DependsOn in the data model. keeping it as an unordered csv for now is fine). Typically HTAN manifests start with the following attributes in this order (Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID), followed by all other metadata attributes.

@clarisse-lau
Copy link
Contributor

Just had a chat with Ashley & Adam. We'd like to subset the attribute list even further to include only sequencing attributes (plus the descriptor columns: Component, Filename, File Format, HTAN Parent Biospecimen ID, HTAN Data File ID, Data Type).

Clinical/biospecimen fields will be annotated separately by the center and pulled in from those templates respectively (as is currently done in the metadata generation scripts).

@aclayton555
Copy link
Contributor Author

@aditigopalan just checking on this and if there is anything you need the team to review at this stage. We are aiming to have this implemented and available for the Stanford center to test with the close out of our 24-5 sprint

@aditigopalan
Copy link
Contributor

Thanks for checking in! Please let me know if this needs to be subsetted further @aclayton555 @adamjtaylor
output.csv

@adamjtaylor
Copy link
Contributor

Thanks @aditigopalan I think we only need to have those attributes that actually come from the sequencing technology as the others will come from our Biospecimen and Clinical elements. So lets drop those and keep:

  • Genomic Reference
  • Library layout
  • Data Type
  • Sequencing Platform
  • Library Selection Method

Plus the minimal HTAN columns for a component

  • HTAN Data File ID
  • HTAN Parent Biospecimen ID
  • Filename
  • File Format

@adamjtaylor
Copy link
Contributor

@aditigopalan if you can open a draft PR and link to this issue that would be useful. Thank you!

@aclayton555
Copy link
Contributor Author

Add "CDS" prefix to all attributes for this template

@adamjtaylor
Copy link
Contributor

Merged! @aditigopalan if you could generate the template using schematic or staging DCA and report if it looks sensible that would be great.

@aditigopalan
Copy link
Contributor

@adamjtaylor tested using dca-staging! Looks alright to me.

@aclayton555
Copy link
Contributor Author

AMAZING (and radical) COLLABORATION ON THIS!

@adamjtaylor
Copy link
Contributor

Not quite out of the woods yet! @aditigopalan is chasing down an errant loop in the DAG that is dragging some extra attributes into the template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants