Representing Sequencing Information and Genomic Data Files #23

allisonheath · 2020-09-01T14:47:04Z

Requester information

Please provide the following information:

Name: Allison Heath
Affiliations: CHOP/KFDRC

Request Details

Please provide the following information about what you wanting to accomplish with your model change request:

Purpose: Determine if there is a good representing in FHIR for Genomic Data Files or if model changes may be needed. Could potentially split this
Who it benefits: All platforms that want to link clinical data to the genomic data files
Use case: Currently we link genomic data file back via a representation of the sequencing experiment that generated them linked to the aliquot that was sequenced (see Representing Biospecimens #22). Below are example of some of the fields we're currently storing, however we recognize that this structure is less than ideal. Some of these attributes should likely shift around because they're various metrics on the resulting genomic files versus part of the sequencing experiment themselves. We also have a list of QC metrics and other experimental information we've wanted to include in our current model. A key one to add that comes to mind is the DRS URI/ID. Only a handful of these fields are typically used for searching (experiment strategy comes to mind), most of the others are important once you've found the files to use for analysis purposes or to further evaluate utility.

entity	property
genomic_file	acl
genomic_file	availability
genomic_file	controlled_access
genomic_file	data_type
genomic_file	external_id
genomic_file	file_format
genomic_file	file_name
genomic_file	hashes
genomic_file	is_harmonized
genomic_file	kf_id
genomic_file	paired_end
genomic_file	reference_genome
genomic_file	size
genomic_file	urls
genomic_file	visible
sequencing_center	external_id
sequencing_center	name
sequencing_center	kf_id
sequencing_center	visible
sequencing_experiment	experiment_date
sequencing_experiment	experiment_strategy
sequencing_experiment	external_id
sequencing_experiment	instrument_model
sequencing_experiment	is_paired_end
sequencing_experiment	kf_id
sequencing_experiment	library_name
sequencing_experiment	library_prep
sequencing_experiment	library_selection
sequencing_experiment	library_strand
sequencing_experiment	max_insert_size
sequencing_experiment	mean_depth
sequencing_experiment	mean_insert_size
sequencing_experiment	mean_read_length
sequencing_experiment	platform
sequencing_experiment	sequencing_center_id
sequencing_experiment	total_reads
sequencing_experiment	visible

We have done a few iterations on this @liberaliscomputing could you provide a bit more details on our last version we're going to try for this?

cc @nicholasvk @youngnm @baileyckelly

liberaliscomputing · 2020-09-15T17:02:54Z

The KF FHIR team has curated modeling discussions on KFDRC FHIR Model Mappings.

Through a series of extensive discussions, we decided to model the above entities and properties into two components, kfdrc-genomic-file using DocumentReference and kfdrc-sequencing-experiment using Task. sequencing_center can directly be mapped to Organization without needing the creation of a new profile, so we didn't include it in modeling.

1. `kfdrc-genomic-file`

We decided to use DocumentReference as a base profile based on other initiatives' effort in this area:

Phenopackets' HtsFile: https://aehrc.github.io/fhir-phenopackets-ig/StructureDefinition-HtsFile.html
AnVIL's anvil-document-reference: http://anvil-fhir.s3-website-us-west-2.amazonaws.com/StructureDefinition-anvil-document-reference.html

Then, from the above properties regarding genomic_file, we excluded is_harmonized, paired_end, and reference_genome because these properties are conceptually not file metadata, but output dimensions of genomic sequencing.

On top of this, we decided to add an extension called accession-identifier because, in KFDRC, we control file accession based on various levels of user authorization.

Our modeling effort as part of the software development cycle has been curated here:

Issue: Develop and test Kids First genomic file conformance resources kids-first/kf-model-fhir#187
Pull request (draft): ✨ Add genomic file conformance resources kids-first/kf-model-fhir#191

While working on modeling kfdrc-genomic-file, we found the following issues (described in the issue in detail):

Size
- Problem: The data type of DocumentReference.content.attachment.size is unsignedInt which ranges between 0 and 2,147,483,647. KF genomic_files usually overflow this range limit.
- Intermediate solution: Add a new extension called large-size where the type of data is decimal which doesn't have a range limit. This extension will be bound to Attachment.
Data type / file format
- Problem: Identify canonical sets of data type and file format codes.
- Intermediate solution: Create CodeSystem-data-type and CodeSystem-file-format and bind these to ValueSet-data-type and ValueSet-file-format respectively. Finally, bind these ValueSets to DocumentReference.type and DocumentReference.content.format respectively.

The following is an example resource:

{
  "resourceType": "DocumentReference",
  "id": "gf-001",
  "meta": {
    "profile": [
      "http://fhir.kids-first.io/StructureDefinition/kfdrc-genomic-file"
    ],
    "versionId": "0.1.0"
  },
  "identifier": [
    {
      "system": "https://kf-api-dataservice.kidsfirstdrc.org/genomic-files?study_id=SD_PREASA7S",
      "value": "kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam"
    }
  ],
  "extension": [
    {
      "extension": [
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "phs001138.c1"
          }
        },
        {
          "url": "accession",
          "valueIdentifier": {
            "value": "SD_PREASA7S"
          }
        }
      ],
      "url": "http://fhir.kids-first.io/StructureDefinition/accession-identifier"
    }
  ],
  "status": "current",
  "type": {
    "coding": [
      {
        "system": "http://fhir.kids-first.io/CodeSystem/data-type",
        "code": "C164052",
        "display": "Aligned Sequence Read"
      }
    ],
    "text": "Aligned Reads"
  },
  "subject": {
    "reference": "Patient/pt-001"
  },
  "content": [
    {
      "attachment": {
        "extension": [
          {
            "url": "http://fhir.kids-first.io/StructureDefinition/large-size",
            "valueDecimal": 72605537636
          }
        ],
        "url": "s3://kf-seq-data-bcm/seidman/HMNVCCCXX-7.hgv.bam",
        "title": "HMNVCCCXX-7.hgv.bam"
      },
      "format": {
        "display": "bam"
      }
    }
  ]
}

liberaliscomputing · 2020-09-15T18:13:16Z

2. `kfdrc-sequencing-experiment`

HL7 has made a concerted effort to bring in genoimcs largely using DiagnosticReport, MolecularSequence, and Observation. The main differences between HL7's genomics implementation and KF's sequencing_experiment include:

The HL7 implementation focuses on downstream, specialized types of reporting such as specific sequences, gene mutations, variants, etc.
In D3b, we have another unit called BIXU in charge of a variety of genomics reporting including, but not limited to the above, but what we actually capture with sequencing_experiment is the processes of 1) sequencing specimens to yield "source" genomic_files (by sequencing_centers) and 2) aligning these source genomic_files against specific reference_genomes to yield "harmonized" genomic_files (by BIXU).

Against this backdrop, our initial effort in modeling kfdrc-sequencing-experiment focuses on the above-explained "processes."

Given the above properties of sequencing_experiment, we characterized them into three dimensions:

Information about a sequencing event;
Sequencing inputs; and
Sequencing outputs

"Information about a sequencing event" is a set of metadata such as experiment date (Task.authoredOn) and performer (Task.owner) and we discuss sequencing inputs and outputs per process (i.e. source / harmonized) in detail below.

Our modeling effort as part of the software development cycle has been curated here:

2.1 `Task` as `partOf` `Task`

Currently, the KF DRC briefly undergoes the following process:

Register biospecimens;
Register source sequencing_experiments given sequencing manifests from sequencing_centers;
Register source genomic_files uploaded to our S3 by sequencing_centers;
Link biospecimens and source genomic_files; and
Link source sequencing_experiments and source genomic_files

Once BIXU's delivered harmonized genomic_files:

Register harmonized genomic_files uploaded to our S3 by BIXU;
Link biospecimens and harmonized genomic_files; and
Link source sequencing_experiments and harmonized genomic_files

Technically, the harmonized genomic_files are yielded via different sequencing_experiments. Why we have done as illustrated above is that the current KF model doesn't have a means to bundle source and harmonized sequencing_experiments (if we've created separate harmonized sequencing_experiments).

Using FHIR's Task well addresses the above issue because a Task can be part of another Task (Task.partOf). Therefore, we imagine having three Tasks, one parent Task and the other children (one for source sequencing_experiment and the other for harmonized sequencing_experiment). Please see the "Genomics (workflow)" tab of KFDRC FHIR ERD for it graphically renders the proposed concept.

2.2 Source `kfdrc-sequencing-experiment`

Below shows our KF entities / properties >> FHIR attributes mappings:

Inputs (Task.input)

biospecimen >> valueReference
experiment_strategy >> valueCodeableConcept
instrument_model >> valueCodeableConcept
is_paried_end >> valueBoolean
library_name >> valueString
library_prep >> valueCodeableConcept
library_selection >> valueCodeableConcept
library_strand >> valueCodeableConcept
platform >> valueCodeableConcept

Outputs (Task.output)

genomic_file >> valueReference
is_harmonized >> valueBoolean
paired_end >> valueInteger
reference_genome >> valueCodeableConcept
max_insert_size >> valueQuantity
mean_depth >> valueQuantity
mean_insert_size >> valueQuantity
mean_read_length >> valueQuantity
total_reads >> valueQuantity

2.3 Harmonized `kfdrc-sequencing-experiment`

Below shows our KF entities / properties >> FHIR attributes mappings:

Inputs (Task.input)

genomic_file >> valueReference

Outputs (Task.output)

genomic_file >> valueReference
reference_genome >> valueCodeableConcept

liberaliscomputing · 2020-09-16T16:17:00Z

Re Data type / file format for kfdrc-genomic-file, during the standup on 09-16-2020, we temporarily decided:

Data type: to create new CodeSystem or ValueSet off of NCIt.
File format: to create neither CodeSystem nor ValueSet until we've found an established, self-maintained ontology. We will simply put KF's existing file format enumerations to DocumentReference.content.format.display.

bwalsh · 2020-10-05T17:47:46Z

DRS is the GA4GH preferred mechanism to represent file objects: The Data Repository Service (DRS) API provides a generic interface to data repositories so data consumers, including workflow systems, can access data in a single, standardized way regardless of where it’s stored or how it’s managed. The primary functionality of DRS is to map a logical ID to a means for physically retrieving the data represented by the ID. -- data-repository-service-schemas

FHIR representations of files associated with Study, Subject, Specimen will be more useful to downstream use cases if they contained DRS Attributes.

A FSH extension of the original openapi definition:

Profile:        DRSAttachment
Parent:         Attachment
Id:             drs-attachment
Title:          "DRS Attachment"
Description:    "A FHIR Attachment extended with DRS Object attributes."
// https://github.com/ga4gh/data-repository-service-schemas/blob/master/openapi/data_repository_service.swagger.yaml#L190-L304

// adds DRSObject to Attachment
* extension contains DRSObject named drs 0..1

 



// inline definition of sub-extensions
Extension:  DRSObject
Id: drs-object
Title: "DRS Object"
Description: "The drs object"
* extension contains
    id 1..1 MS and
    name 0..1 and
    self_uri 1..1 MS and
    size 1..1 MS and
    created_time 1..1 MS and
    updated_time 0..1 and
    version 0..1 and
    mime_type 0..1 
    // and DRSChecksum named checksums 1..* MS
    // and DRSAccessMethod named access_methods 1..* MS

* extension[id] ^short = "An identifier unique to this `DrsObject`."
* extension[id].value[x] only string
* extension[name] ^short = "A string that can be used to name a `DrsObject`."
* extension[name].value[x] only string
* extension[self_uri] ^short = "A drs:// URI, as defined in the DRS documentation, that tells clients how to access this object."
* extension[self_uri].value[x] only string
* extension[size] ^short = "For blobs, the blob size in bytes.  For bundles, the cumulative size, in bytes, of items in the `contents` field."
* extension[size].value[x] only integer
* extension[created_time] ^short = "Timestamp of content creation in RFC3339."
* extension[created_time].value[x] only dateTime
* extension[updated_time] ^short = "Timestamp of content update in RFC3339, identical to `created_time` in systems that do not support updates."
* extension[updated_time].value[x] only dateTime
* extension[version] ^short = "A string representing a version. (Some systems may use checksum, a RFC3339 timestamp, or an incrementing version number.)"
* extension[version].value[x] only string
* extension[mime_type] ^short = "A string providing the mime-type of the `DrsObject`."
* extension[mime_type].value[x] only string


Extension:  DRSChecksum
Id: drs-checksum
Title: "DRS Checksum"
Description: "The checksum of the `DrsObject`. At least one checksum must be provided."    
* extension contains
    checksum 1..1 MS and
    type 1..1 MS
* extension[checksum] ^short = "The hex-string encoded checksum for the data."
* extension[checksum].value[x] only string
* extension[type] ^short = "The digest method used to create the checksum."
* extension[type].value[x] only string


Extension:  DRSAccessMethod
Id: drs-access-method
Title: "DRS AccessMethod"
Description: "The list of access methods that can be used to fetch the `DrsObject`."    
* extension contains
    type 1..1 MS and
    access_url 0..1 and
    access_id 0..1 and
    region 0..1
* extension[type] ^short = "Type of the access method."
* extension[type].value[x] only string
* extension[access_url] ^short = "An `AccessURL` that can be used to fetch the actual object bytes."
* extension[access_url].value[x] only string
* extension[access_id] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[access_id].value[x] only string
* extension[region] ^short = "An arbitrary string to be passed to the `/access` method to get an `AccessURL`."
* extension[region].value[x] only string


Instance: DRSAttachmentExample
InstanceOf: DRSAttachment
Description: "An example representation of a DRSAttachment"
Usage: #inline
* id = "any-attachment-id"
* contentType = #application/json
* extension[drs].extension[id].valueString = "any-id"
* extension[drs].extension[name].valueString = "any-file-name"
* extension[drs].extension[self_uri].valueString = "drs://url-here"
* extension[drs].extension[created_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[updated_time].valueDateTime = "1985-04-12T23:20:50.52Z"
* extension[drs].extension[size].valueInteger = 12345
* extension[drs].extension[version].valueString = "0.0.0"
* extension[drs].extension[mime_type].valueString = "application/json"
* extension[drs].extension[checksums].extension[checksum].valueString = "abcdef0123456789"
* extension[drs].extension[checksums].extension[type].valueString = "etag"
* extension[drs].extension[access_methods].extension[type].valueString = "s3"
* extension[drs].extension[access_methods].extension[access_url].valueString = "s3://some-url-here"
* extension[drs].extension[access_methods].extension[region].valueString = "us-west"

Adapted from From NIH-NCPI/ncpi-model-forge#23

allisonheath added the Model: New request A new request has been submitted for a NCPI FHIR model change label Sep 1, 2020

allisonheath assigned RobertJCarroll and katiebanaz Sep 1, 2020

allisonheath added the Kids First DRC label Sep 1, 2020

torstees mentioned this issue Sep 16, 2020

Add support for CMG Sequencing table data anvilproject/cmg-data-ingest#1

Open

bwalsh mentioned this issue Oct 14, 2020

✨ Add DRS implementation #42

Merged

allisonheath mentioned this issue Oct 22, 2020

Referencing DRS Objects #46

Closed

xavanx added a commit to NimbusInformatics/bdcat-fhir-azure-prototype that referenced this issue Oct 27, 2020

Create bam_document_reference.json

c851217

Adapted from From NIH-NCPI/ncpi-model-forge#23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing Sequencing Information and Genomic Data Files #23

Representing Sequencing Information and Genomic Data Files #23

allisonheath commented Sep 1, 2020

liberaliscomputing commented Sep 15, 2020 •

edited

Loading

liberaliscomputing commented Sep 15, 2020 •

edited

Loading

liberaliscomputing commented Sep 16, 2020 •

edited

Loading

bwalsh commented Oct 5, 2020 •

edited

Loading

Representing Sequencing Information and Genomic Data Files #23

Representing Sequencing Information and Genomic Data Files #23

Comments

allisonheath commented Sep 1, 2020

Requester information

Request Details

liberaliscomputing commented Sep 15, 2020 • edited Loading

1. kfdrc-genomic-file

liberaliscomputing commented Sep 15, 2020 • edited Loading

2. kfdrc-sequencing-experiment

2.1 Task as partOf Task

2.2 Source kfdrc-sequencing-experiment

2.3 Harmonized kfdrc-sequencing-experiment

liberaliscomputing commented Sep 16, 2020 • edited Loading

bwalsh commented Oct 5, 2020 • edited Loading

liberaliscomputing commented Sep 15, 2020 •

edited

Loading

1. `kfdrc-genomic-file`

liberaliscomputing commented Sep 15, 2020 •

edited

Loading

2. `kfdrc-sequencing-experiment`

2.1 `Task` as `partOf` `Task`

2.2 Source `kfdrc-sequencing-experiment`

2.3 Harmonized `kfdrc-sequencing-experiment`

liberaliscomputing commented Sep 16, 2020 •

edited

Loading

bwalsh commented Oct 5, 2020 •

edited

Loading