Merge pull request #812 from neuropoly/bep031_sample_entity

[ENH] BEP031 - New entity: sample and samples.tsv file
bids-standard · Jul 26, 2021 · 1323f23 · 1323f23
2 parents 275f771 + f02aff2
commit 1323f23
Show file tree

Hide file tree

Showing 4 changed files with 89 additions and 0 deletions.
diff --git a/src/02-common-principles.md b/src/02-common-principles.md
@@ -32,6 +32,13 @@ misunderstanding we clarify them here.
     context, a session may also indicate a group of related scans,
     taken in one or more visits.
 
+1.  **Sample** - a sample pertaining to a subject such as tissue, primary cell
+    or cell-free sample.
+    The `sample-<label>` key/value pair is used to distinguish between different
+    samples from the same subject.
+    The label MUST be unique per subject and is RECOMMENDED to be unique
+    throughout the dataset.
+
 1.  **Data acquisition** - a continuous uninterrupted block of time during which
     a brain scanning instrument was acquiring data according to particular
     scanning sequence/protocol.

diff --git a/src/03-modality-agnostic-files.md b/src/03-modality-agnostic-files.md
@@ -255,6 +255,72 @@ to date of birth.
 }
 ```
 
+## Samples file
+
+Template:
+
+```Text
+samples.tsv
+samples.json
+```
+
+The purpose of this file is to describe properties of samples, indicated by the `sample` entity.
+This file is REQUIRED if `sample-<label>` is present in any file name within the dataset.
+If this file exists, it MUST contain the three following columns:
+
+-   `sample_id`: MUST consist of `sample-<label>` values identifying one row
+    for each sample
+
+-   `participant_id`: MUST consist of `sub-<label>`
+
+-   `sample_type`: MUST consist of sample type values, either `cell line`, `in vitro differentiated cells`,
+    `primary cell`, `cell-free sample`, `cloning host`, `tissue`, `whole organisms`, `organoid` or
+    `technical sample` from [ENCODE Biosample Type](https://www.encodeproject.org/profiles/biosample_type)
+
+Other optional columns MAY be used to describe the samples.
+Each sample MUST be described by one and only one row.
+
+Commonly used *optional* columns in `samples.tsv` files are `pathology` and
+`derived_from`. We RECOMMEND to make use of these columns, and in case that
+you do use them, we RECOMMEND to use the following values for them:
+
+-   `pathology`: string value describing the pathology of the sample or type of control.
+    When different from `healthy`, pathology SHOULD be specified in `samples.tsv`.
+    The pathology MAY instead be specified in [Sessions files](06-longitudinal-and-multi-site-studies.md#sessions-file)
+    in case it changes over time.
+
+-   `derived_from`: `sample-<label>` key/value pair from which a sample is derived from,
+    for example a slice of tissue (`sample-02`) derived from a block of tissue (`sample-01`),
+    as illustrated in the example below.
+
+`samples.tsv` example:
+
+```Text
+sample_id participant_id sample_type derived_from
+sample-01 sub-01 tissue n/a
+sample-02 sub-01 tissue sample-01
+sample-03 sub-01 tissue sample-01
+sample-04 sub-02 tissue n/a
+sample-05 sub-02 tissue n/a
+```
+
+It is RECOMMENDED to accompany each `samples.tsv` file with a sidecar
+`samples.json` file to describe the TSV column names and properties of their values
+(see also the [section on tabular files](02-common-principles.md#tabular-files)).
+
+`samples.json` example:
+
+```JSON
+{
+    "sample_type": {
+        "Description": "type of sample from ENCODE Biosample Type (https://www.encodeproject.org/profiles/biosample_type)",
+    },
+    "derived_from": {
+        "Description": "sample_id from which the sample is derived"
+    }
+}
+```
+
 ## Phenotypic and assessment data
 
 Template:

diff --git a/src/schema/entities.yaml b/src/schema/entities.yaml
@@ -27,6 +27,17 @@ session:
     (for example, training).
   type: string
   format: label
+sample:
+  name: Sample
+  entity: sample
+  description: |
+    A sample pertaining to a subject such as tissue, primary cell
+    or cell-free sample.
+    The `sample-<label>` key/value pair is used to distinguish between different
+    samples from the same subject.
+    The label MUST be unique per subject and is RECOMMENDED to be unique
+    throughout the dataset.
+  format: label
 task:
   name: Task
   entity: task

diff --git a/src/schema/top_level_files.yaml b/src/schema/top_level_files.yaml
@@ -24,3 +24,8 @@ participants:
   extensions:
     - .tsv
     - .json
+samples:
+  required: false
+  extensions:
+    - .tsv
+    - .json