Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

Representing Sequencing Information and Genomic Data Files #23

Open
allisonheath opened this issue Sep 1, 2020 · 4 comments
Open

Representing Sequencing Information and Genomic Data Files #23

allisonheath opened this issue Sep 1, 2020 · 4 comments
Assignees
Labels
Kids First DRC Model: New request A new request has been submitted for a NCPI FHIR model change

Comments

@allisonheath
Copy link
Member

Requester information

Please provide the following information:

  • Name: Allison Heath
  • Affiliations: CHOP/KFDRC

Request Details

Please provide the following information about what you wanting to accomplish with your model change request:

  • Purpose: Determine if there is a good representing in FHIR for Genomic Data Files or if model changes may be needed. Could potentially split this
  • Who it benefits: All platforms that want to link clinical data to the genomic data files
  • Use case: Currently we link genomic data file back via a representation of the sequencing experiment that generated them linked to the aliquot that was sequenced (see Representing Biospecimens #22). Below are example of some of the fields we're currently storing, however we recognize that this structure is less than ideal. Some of these attributes should likely shift around because they're various metrics on the resulting genomic files versus part of the sequencing experiment themselves. We also have a list of QC metrics and other experimental information we've wanted to include in our current model. A key one to add that comes to mind is the DRS URI/ID. Only a handful of these fields are typically used for searching (experiment strategy comes to mind), most of the others are important once you've found the files to use for analysis purposes or to further evaluate utility.
entity property
genomic_file acl
genomic_file availability
genomic_file controlled_access
genomic_file data_type
genomic_file external_id
genomic_file file_format
genomic_file file_name
genomic_file hashes
genomic_file is_harmonized
genomic_file kf_id
genomic_file paired_end
genomic_file reference_genome
genomic_file size
genomic_file urls
genomic_file visible
sequencing_center external_id
sequencing_center name
sequencing_center kf_id
sequencing_center visible
sequencing_experiment experiment_date
sequencing_experiment experiment_strategy
sequencing_experiment external_id
sequencing_experiment instrument_model
sequencing_experiment is_paired_end
sequencing_experiment kf_id
sequencing_experiment library_name
sequencing_experiment library_prep
sequencing_experiment library_selection
sequencing_experiment library_strand
sequencing_experiment max_insert_size
sequencing_experiment mean_depth
sequencing_experiment mean_insert_size
sequencing_experiment mean_read_length
sequencing_experiment platform
sequencing_experiment sequencing_center_id
sequencing_experiment total_reads
sequencing_experiment visible

We have done a few iterations on this @liberaliscomputing could you provide a bit more details on our last version we're going to try for this?

cc @nicholasvk @youngnm @baileyckelly

@allisonheath allisonheath added the Model: New request A new request has been submitted for a NCPI FHIR model change label Sep 1, 2020
@liberaliscomputing
Copy link
Member

liberaliscomputing commented Sep 15, 2020

The KF FHIR team has curated modeling discussions on KFDRC FHIR Model Mappings.

Through a series of extensive discussions, we decided to model the above entities and properties into two components, kfdrc-genomic-file using DocumentReference and kfdrc-sequencing-experiment using Task. sequencing_center can directly be mapped to Organization without needing the creation of a new profile, so we didn't include it in modeling.

1. kfdrc-genomic-file

We decided to use DocumentReference as a base profile based on other initiatives' effort in this area:

Then, from the above properties regarding genomic_file, we excluded is_harmonized, paired_end, and reference_genome because these properties are conceptually not file metadata, but output dimensions of genomic sequencing.

On top of this, we decided to add an extension called accession-identifier because, in KFDRC, we control file accession based on various levels of user authorization.

Our modeling effort as part of the software development cycle has been curated here:

While working on modeling kfdrc-genomic-file, we found the following issues (described in the issue in detail):

  • Size
    • Problem: The data type of DocumentReference.content.attachment.size is unsignedInt which ranges between 0 and 2,147,483,647. KF genomic_files usually overflow this range limit.
    • Intermediate solution: Add a new extension called large-size where the type of data is decimal which doesn't have a range limit. This extension will be bound to Attachment.
  • Data type / file format
    • Problem: Identify canonical sets of data type and file format codes.
    • Intermediate solution: Create CodeSystem-data-type and CodeSystem-file-format and bind these to ValueSet-data-type and ValueSet-file-format respectively. Finally, bind these ValueSets to DocumentReference.type and DocumentReference.content.format respectively.

The following is an example resource:

{
  "resourceType": "DocumentReference",
  "id": "gf-001",
  "meta": {
    "profile": [
      "http://fhir.kids-first.io/StructureDefinition/kfdrc-genomic-file"
    ],
    "versionId": "0.1.0"
  },
  "identifier": [
    {
      "system": "https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?study_id=SD_PREASA7S",
      "value": "kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam"
    }
  ],
  "extension": [
    {
      "extension": [
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "phs001138.c1"
          }
        },
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "SD_PREASA7S"
          }
        }
      ],
      "url": "http://fhir.kids-first.io/StructureDefinition/accession-identifier"
    }
  ],
  "status": "current",
  "type": {
    "coding": [
      {
        "system": "http://fhir.kids-first.io/CodeSystem/data-type",
        "code": "C164052",
        "display": "Aligned Sequence Read"
      }
    ],
    "text": "Aligned Reads"
  },
  "subject": {
    "reference": "Patient/pt-001"
  },
  "content": [
    {
      "attachment": {
        "extension": [
          {
            "url": "http://fhir.kids-first.io/StructureDefinition/large-size",
            "valueDecimal": 72605537636
          }
        ],
        "url": "s3://kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam",
        "title": "HMNVCCCXX-7.hgv.bam"
      },
      "format": {
        "display": "bam"
      }
    }
  ]
}

@liberaliscomputing
Copy link
Member

liberaliscomputing commented Sep 15, 2020

2. kfdrc-sequencing-experiment

HL7 has made a concerted effort to bring in genoimcs largely using DiagnosticReport, MolecularSequence, and Observation. The main differences between HL7's genomics implementation and KF's sequencing_experiment include:

  • The HL7 implementation focuses on downstream, specialized types of reporting such as specific sequences, gene mutations, variants, etc.
  • In D3b, we have another unit called BIXU in charge of a variety of genomics reporting including, but not limited to the above, but what we actually capture with sequencing_experiment is the processes of 1) sequencing specimens to yield "source" genomic_files (by sequencing_centers) and 2) aligning these source genomic_files against specific reference_genomes to yield "harmonized" genomic_files (by BIXU).

Against this backdrop, our initial effort in modeling kfdrc-sequencing-experiment focuses on the above-explained "processes."

Given the above properties of sequencing_experiment, we characterized them into three dimensions:

  • Information about a sequencing event;
  • Sequencing inputs; and
  • Sequencing outputs

"Information about a sequencing event" is a set of metadata such as experiment date (Task.authoredOn) and performer (Task.owner) and we discuss sequencing inputs and outputs per process (i.e. source / harmonized) in detail below.

Our modeling effort as part of the software development cycle has been curated here:

2.1 Task as partOf Task

Currently, the KF DRC briefly undergoes the following process:

  1. Register biospecimens;
  2. Register source sequencing_experiments given sequencing manifests from sequencing_centers;
  3. Register source genomic_files uploaded to our S3 by sequencing_centers;
  4. Link biospecimens and source genomic_files; and
  5. Link source sequencing_experiments and source genomic_files

Once BIXU's delivered harmonized genomic_files:

  1. Register harmonized genomic_files uploaded to our S3 by BIXU;
  2. Link biospecimens and harmonized genomic_files; and
  3. Link source sequencing_experiments and harmonized genomic_files

Technically, the harmonized genomic_files are yielded via different sequencing_experiments. Why we have done as illustrated above is that the current KF model doesn't have a means to bundle source and harmonized sequencing_experiments (if we've created separate harmonized sequencing_experiments).

Using FHIR's Task well addresses the above issue because a Task can be part of another Task (Task.partOf). Therefore, we imagine having three Tasks, one parent Task and the other children (one for source sequencing_experiment and the other for harmonized sequencing_experiment). Please see the "Genomics (workflow)" tab of KFDRC FHIR ERD for it graphically renders the proposed concept.

2.2 Source kfdrc-sequencing-experiment

Below shows our KF entities / properties >> FHIR attributes mappings:

  1. Inputs (Task.input)
  • biospecimen >> valueReference
  • experiment_strategy >> valueCodeableConcept
  • instrument_model >> valueCodeableConcept
  • is_paried_end >> valueBoolean
  • library_name >> valueString
  • library_prep >> valueCodeableConcept
  • library_selection >> valueCodeableConcept
  • library_strand >> valueCodeableConcept
  • platform >> valueCodeableConcept
  1. Outputs (Task.output)
  • genomic_file >> valueReference
  • is_harmonized >> valueBoolean
  • paired_end >> valueInteger
  • reference_genome >> valueCodeableConcept
  • max_insert_size >> valueQuantity
  • mean_depth >> valueQuantity
  • mean_insert_size >> valueQuantity
  • mean_read_length >> valueQuantity
  • total_reads >> valueQuantity

2.3 Harmonized kfdrc-sequencing-experiment

Below shows our KF entities / properties >> FHIR attributes mappings:

  1. Inputs (Task.input)
  • genomic_file >> valueReference
  1. Outputs (Task.output)
  • genomic_file >> valueReference
  • reference_genome >> valueCodeableConcept

@liberaliscomputing
Copy link
Member

liberaliscomputing commented Sep 16, 2020

Re Data type / file format for kfdrc-genomic-file, during the standup on 09-16-2020, we temporarily decided:

  • Data type: to create new CodeSystem or ValueSet off of NCIt.
  • File format: to create neither CodeSystem nor ValueSet until we've found an established, self-maintained ontology. We will simply put KF's existing file format enumerations to DocumentReference.content.format.display.

@bwalsh
Copy link
Contributor

bwalsh commented Oct 5, 2020

DRS is the GA4GH preferred mechanism to represent file objects: The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. -- data-repository-service-schemas

FHIR representations of files associated with Study, Subject, Specimen will be more useful to downstream use cases if they contained DRS Attributes.

A FSH extension of the original openapi definition:

Profile:        DRSAttachment
Parent:         Attachment
Id:             drs-attachment
Title:          "DRS Attachment"
Description:    "A FHIR Attachment extended with DRS Object attributes."
// https://github.com/ga4gh/data-repository-service-schemas/blob/master/openapi/data_repository_service.swagger.yaml#L190-L304

// adds DRSObject to Attachment
* extension contains DRSObject named drs 0..1

 



// inline definition of sub-extensions
Extension:  DRSObject
Id: drs-object
Title: "DRS Object"
Description: "The drs object"
* extension contains
    id 1..1 MS and
    name 0..1 and
    self_uri 1..1 MS and
    size 1..1 MS and
    created_time 1..1 MS and
    updated_time 0..1 and
    version 0..1 and
    mime_type 0..1 
    // and DRSChecksum named checksums 1..* MS
    // and DRSAccessMethod named access_methods 1..* MS

* extension[id] ^short = "An identifier unique to this `DrsObject`."
* extension[id].value[x] only string
* extension[name] ^short = "A string that can be used to name a `DrsObject`."
* extension[name].value[x] only string
* extension[self_uri] ^short = "A drs:// URI, as defined in the DRS documentation, that tells clients how to access this object."
* extension[self_uri].value[x] only string
* extension[size] ^short = "For blobs, the blob size in bytes.  For bundles, the cumulative size, in bytes, of items in the `contents` field."
* extension[size].value[x] only integer
* extension[created_time] ^short = "Timestamp of content creation in RFC3339."
* extension[created_time].value[x] only dateTime
* extension[updated_time] ^short = "Timestamp of content update in RFC3339, identical to `created_time` in systems that do not support updates."
* extension[updated_time].value[x] only dateTime
* extension[version] ^short = "A string representing a version. (Some systems may use checksum, a RFC3339 timestamp, or an incrementing version number.)"
* extension[version].value[x] only string
* extension[mime_type] ^short = "A string providing the mime-type of the `DrsObject`."
* extension[mime_type].value[x] only string


Extension:  DRSChecksum
Id: drs-checksum
Title: "DRS Checksum"
Description: "The checksum of the `DrsObject`. At least one checksum must be provided."    
* extension contains
    checksum 1..1 MS and
    type 1..1 MS
* extension[checksum] ^short = "The hex-string encoded checksum for the data."
* extension[checksum].value[x] only string
* extension[type] ^short = "The digest method used to create the checksum."
* extension[type].value[x] only string


Extension:  DRSAccessMethod
Id: drs-access-method
Title: "DRS AccessMethod"
Description: "The list of access methods that can be used to fetch the `DrsObject`."    
* extension contains
    type 1..1 MS and
    access_url 0..1 and
    access_id 0..1 and
    region 0..1
* extension[type] ^short = "Type of the access method."
* extension[type].value[x] only string
* extension[access_url] ^short = "An `AccessURL` that can be used to fetch the actual object bytes."
* extension[access_url].value[x] only string
* extension[access_id] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[access_id].value[x] only string
* extension[region] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[region].value[x] only string


Instance: DRSAttachmentExample
InstanceOf: DRSAttachment
Description: "An example representation of a DRSAttachment"
Usage: #inline
* id = "any-attachment-id"
* contentType = #application/json
* extension[drs].extension[id].valueString = "any-id"
* extension[drs].extension[name].valueString = "any-file-name"
* extension[drs].extension[self_uri].valueString = "drs://url-here"
* extension[drs].extension[created_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[updated_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[size].valueInteger = 12345
* extension[drs].extension[version].valueString = "0.0.0"
* extension[drs].extension[mime_type].valueString = "application/json"
* extension[drs].extension[checksums].extension[checksum].valueString = "abcdef0123456789"
* extension[drs].extension[checksums].extension[type].valueString = "etag"
* extension[drs].extension[access_methods].extension[type].valueString = "s3"
* extension[drs].extension[access_methods].extension[access_url].valueString = "s3://some-url-here"
* extension[drs].extension[access_methods].extension[region].valueString = "us-west"

xavanx added a commit to NimbusInformatics/bdcat-fhir-azure-prototype that referenced this issue Oct 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Kids First DRC Model: New request A new request has been submitted for a NCPI FHIR model change
Projects
None yet
Development

No branches or pull requests

5 participants