Skip to content

Commit

Permalink
Add vignette for new command line merge tool
Browse files Browse the repository at this point in the history
- new CLI merge tool to add new clinical
  data to existing ADAT file
- closes #45
  • Loading branch information
stufield committed Apr 21, 2023
1 parent 658edac commit d844898
Show file tree
Hide file tree
Showing 3 changed files with 247 additions and 0 deletions.
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@
simplifying the sample analyses into independent
vignettes (#35)

* New vignette on how to use the CLI merge tool
to merge clinical data into existing ADAT files (#45)


# SomaDataIO 6.0.0 :tada:

Expand Down
8 changes: 8 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,14 @@ articles:
contents:
- linear-regression

- title: Command Line Merge Tool
navbar: ~
desc: >
A convenient CLI merge tool to add new clinical data
to 'SomaScan' data comes with 'SomaDataIO'.
contents:
- cli-merge-tool

reference:
- title: Load an ADAT
desc: >
Expand Down
236 changes: 236 additions & 0 deletions vignettes/cli-merge-tool.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
---
title: "Command Line Merge Tool"
author: "Stu Field, SomaLogic Operating Co., Inc."
output:
rmarkdown::html_vignette:
fig_caption: yes
vignette: >
%\VignetteIndexEntry{Command Line Merge Tool}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
library(SomaDataIO)
library(withr)
Sys.setlocale("LC_COLLATE", "en_US.UTF-8")
knitr::opts_chunk$set(
echo = TRUE,
collapse = TRUE,
comment = "#>"
)
```


# Overview

Occasionally, additional clinical data is obtained _after_ samples have been
submitted to SomaLogic, Inc. or even after 'SomaScan' results have
been delivered.
This requires the new clinical, i.e. non-proteomic, data to be merged
with the 'SomaScan' data into a "new" ADAT prior to analysis.
For this purpose, a command-line interface (CLI) tool has been included
with `SomaDataIO` and exists in the `cli/merge/` directory.


----------------


## Setup

The clinical merge tool is an `R` script that comes with an installation
of `SomaDataIO`:

```{r merge-script}
dir(system.file("cli", "merge", package = "SomaDataIO"))
f <- system.file("cli", "merge", "merge_clin.R", package = "SomaDataIO")
f
```

First create a temporary "analysis" directory:

```{r create-dir}
analysis_dir <- tempfile(pattern = "somascan-")
dir.create(analysis_dir)
dir.exists(analysis_dir)
file.copy(f, to = analysis_dir)
```

Let's create some dummy 'SomaScan' data derived from the `example_data` object
from `SomaDataIO`. First we reduce its size to 5 features and 9 samples, and
then write to text with `write_adat()`.
This will be the "new" starting point for the clinical
data merge and represents where customers would begin their analysis.

```{r save-data}
feats <- withr::with_seed(3, sample(getAnalytes(example_data), 5))
sub_adat <- dplyr::select(example_data, PlateId, SlideId, Subarray,
SampleId, Age, all_of(feats)) |> head(9L)
withr::with_dir(analysis_dir,
write_adat(sub_adat, file = "ex-data-small9.adat")
)
```

Now we create some random clinical data with a common key (this is typically
the `SampleId` identifier but it could be any common key).

```{r create-clin-meta}
df <- data.frame(SampleId = as.character(seq(1, 9, by = 2)),
group = c("a", "b", "a", "b", "a"),
newvar = withr::with_seed(1, rnorm(5)))
df
withr::with_dir(analysis_dir,
write.csv(df, file = "clin-data.csv", row.names = FALSE)
)
```


At this point we have 3 files in a temporary analysis directory:

```{r ls1}
dir(analysis_dir)
```

- `clin-data.csv` containing 3 columns:
+ a common key: `SampleId`
+ a new variable with grouping information: `group`
+ a new variable with a continuous variable: `newvar`
- `ex-data-small9.adat` for 9 samples containing 5 'SomaScan' features,
and some pre-existing variables we would like to merge into:
+ `PlateId`, `SlideId`, `Subarray`, `SampleId`, and `Age`
+ __note:__ `PlateId`, `SlideId`, and `Subarray` are key fields common
to almost all ADATs; removing them can yield unintended results
+ the common key `SampleId` is required
- `merge_clin.R` the merge script engine itself


## Merging Clinical Data

The clinical data merge tool is simple to use at most common command line
terminals (`BASH`, `ZSH`, etc.). You must have `R` installed
(and available) with `SomaDataIO` and its dependencies installed.

### Arguments

The merge script takes 4 arguments:

1. path to the ADAT (`*.adat`) file
1. path to clinical data (`*.csv`) file
1. common key variable name (`SampleId`)
1. output file name (`*.adat`)



---------------


### Standard Syntax

The primary syntax is for when the common key in __both__ files,
ADAT and CSV, have the _same_ variable name:


```bash
# change directory to the analysis path
cd `r analysis_dir`

# run the Rscript
Rscript --vanilla merge_clin.R ex-data-small9.adat clin-data.csv SampleId ex-data-small9-merged.adat
```

```{r syscall1, include = FALSE}
withr::with_dir(analysis_dir,
base::system2(
"Rscript",
c("--vanilla", "merge_clin.R", "ex-data-small9.adat", "clin-data.csv",
"SampleId", "ex-data-small9-merged.adat")
)
)
```

```{r ls2}
dir(analysis_dir)
```


#### Alternative Syntax

In certain instances you may have the common key under a _different_ variable
name in their respective files. This is handled by a modification to
argument 3, which now takes the form `key1=key2` where `key1` contains the
common keys in the `*.adat` file, and `key2` contains keys for the `*.csv` file.

First let's create a new clinical data file with a different
variable name, `ClinID`:

```{r create-clin2}
names(df) <- c("ClinID", "letter", "size") # rename original `df`
df
withr::with_dir(analysis_dir,
write.csv(df, file = "clin-data2.csv", row.names = FALSE)
)
```

We can now execute the merge script at the command line:

```bash
Rscript --vanilla merge_clin.R ex-data-small9.adat clin-data2.csv SampleId=ClinID ex-data-small9-merged2.adat
```

```{r syscall2, include = FALSE}
withr::with_dir(analysis_dir,
base::system2(
"Rscript",
c("--vanilla", "merge_clin.R", "ex-data-small9.adat", "clin-data2.csv",
"SampleId=ClinID", "ex-data-small9-merged2.adat")
)
)
```

```{r ls3}
dir(analysis_dir)
```

## Check Results

Now let's check that the merge was successful and yields the expected
`*.adat`:

```{r new-adat}
new <- withr::with_dir(analysis_dir,
read_adat("ex-data-small9-merged2.adat")
)
new
getMeta(new)
getAnalytes(new)
```


## Closing

Merging newly obtained clinical variables into existing 'SomaScan' ADATs
is easy via the `merge_clin.R` script provided with `SomaDataIO`.
If you run into any trouble please do not hesitate to reach out
to <[email protected]> or
[file an issue](https://github.com/SomaLogic/SomaDataIO/issues/new) on
our [GitHub](https://github.com/SomaLogic/SomaDataIO) repository.


```{r teardown, include = FALSE}
if ( dir.exists(analysis_dir) ) {
unlink(analysis_dir, force = TRUE)
}
```

---------------------


Created by [Rmarkdown](https://github.com/rstudio/rmarkdown)
(v`r utils::packageVersion("rmarkdown")`) and `r R.version$version.string`.

0 comments on commit d844898

Please sign in to comment.