Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: transfer of text about runtime uploads #30

Merged
merged 10 commits into from
Jan 2, 2024
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions docs/design/images/runtime-login-sequence.puml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
@startuml runtime-login-sequence

title Login and Authentication Sequence of a Registered User
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved

participant "User" as u
participant "Frontend" as f
participant "API" as api
participant "Backend" as db

u -> f --: Enter login information
f -> api --: Send login request
api -> db ++: Check permissions level
db --> api --: Send permissions level
api --> u --: Show permission-dependent content
@enduml
185 changes: 185 additions & 0 deletions docs/design/runtime-view.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
---
title: "Runtime View"
---

This section describes the concrete behaviour, interactions, and
pathways that data take within Sprout. "Runtime" in this case refers
to how the software works "in action".

## Login and Authentication

Almost all users will need to log into the Sprout-managed Data
Resources. The steps for logging in and having their permission levels
checked follows the sequence described in the figure below.

![Login and Authentication Sequence of a Registered User.](images/runtime-login-sequence.png)

## Data Input

The overall aim of this section is to describe the general path that
data takes through a Seedcase Data Resource, from input into the final
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
output. Specifically, these items are described as:

- *Input*: Because we currently focus on health research, the type of
input data and metadata is what is typically generated from
health studies. This could be in the form of e.g., csv or Excel files, json files, or image files.
- *Output*: The final output is the input data stored
together as a single database, or at least multiple databases and
files explicitly linked in such a way that it conceptually
represents a single database.

### Expected Type of Input Data

Given the (current) focus on health data as well as the team's experiences with research and health data,
we make some assumptions about the type of data that will be input into
Sprout. Health data tends to consist of specific types of data:

<!---
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
- **Clinical**: This data is typically collected during patient visits
to doctors. Depending on the country or administrative region, there
will likely already be well-established data processing and storage
pipelines in place. --->
- **Register**: This type of data is highly dependent on the country
or region. Generally, this data is collected for national or
regional administrative purposes, such as, recording employment
status, income, address, medication purchases, and diagnoses. Like
the routine clinical data, the pipelines in place for processing and
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
storage of this data are usually very extensive and well established.
<!---
- **Biological sample data**: This type of data is generated from
biological samples, like blood, saliva, semen, hair, or urine. Data
generated from sample analytic techniques often produce large
volumes of data per person. Samples may be generated in larger
established laboratories or in smaller research groups, depending on
how what analytic technology is used and how new it is. The
structure and format of the generated data also tends to be highly
variable and depends heavily on the technology used, sometimes
requiring specialized software to process and output. --->
- **Survey or questionnaire**: This type of data is often done based
on a given study's aims and research questions. There are hundreds
of different questionnaires that can have highly specific purposes
and uses for their data. They are also highly variable in the volume
of data collected based on the survey, and on the format of the
data.
These types of input data are commonly formatted in text (txt) files, comma-separated value (csv) files, Excel (XLSX) files, and JSON structures.

K-Beicher marked this conversation as resolved.
Show resolved Hide resolved
### Expected Flow of Input Data

The above described data tends to fit into, mostly, two categories for
data input.

- *Routine or continuous collection*, where ingested data into
Sprout would occur as soon as the data was collected from one
"observational unit"[^1] or very shortly afterwards. Clinical data
as well as survey or questionnaire data may likely fall under this
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
category.
- *Batch collection*, where ingested data occurs some time after the
data was collected and from multiple observational units. Biological
sample data would fall under this category, since laboratories
usually run several samples at once and input data after internal
quality control checks and machine-specific data processing. While
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
register-based data does get collected continuously, direct access
to it is only given on a batch basis, usually once every year.
Survey data may also come in batches, depending on the questionnaire
and software used for its collection.

[^1]: Observational unit is the "entity" that the data was collected
from at a given point in time, such as a human participant in a
cohort study or a rat in an animal study at a specific time point.

For sources of data from routine collection with well-established data
input processes, the data input pipeline would likely involve
redirecting these data sources from their generation into Seedcase via a
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved
direct call to the API so the data continues on to the backend and
eventual data storage.

Sources of data that don't have well-established data input processes,
such as data from hospitals or medical laboratories, would need to use the
Sprout data batch-input Web Portal. This Portal only accepts data
that is in a pre-defined format (as determined and created by the Data Management
Administrators) that includes documentation, and potentially automation
scripts on how to pre-process the data prior to uploading it.

These uploaded files might be a variety of file types, like `.csv`,
`.xls`, or `.txt`). Only users with the correct permission levels are
allowed to upload data. It will be the Data Access Administrator who
will be doing the initial upload, as that will entail setting up table schemas
and allocating space in the raw data file storage. The second way of
getting data into the Data Resource is by manually enter it by an
authorized Data Contributor.

Once the data is submitted through the Portal, it is sent in an
encrypted, legally-compliant format to a server and stored in the way
defined by the API and common data model.
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved

### Upload Data to Sprout

An approved user, i.e., a Data Access Administrator or a Data Contributor, will open the login screen in the Web Portal. They
will enter their credentials which will be transmitted to the API layer.
The API Security layer will check with the list of users and permissions
in the database and confirm that the specific user has permission to
enter data into a specific table (or set of tables) in the database.

Once this check is complete the frontend will receive permission from
the API Security layer to display the data entry/upload options for this kind of user role.

Before any of the actions described below can be done, it is expected that appropriate table schemas or entry forms have been created by one or more administrators of the system. This process is described elsewhere.

<!--TODO change elsewhere above to the actual location of where we describe table schema and data entry form creation-->

#### Batch Upload of Data
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved

The user has selected a valid table schema to use, and have uploaded the file to the holding area. This prompts the system to check that the data in the file match the schema in the database on headers and data type. If this validation is successful then the system will inform the user about how many rows of data it found and validated. If the user is in agreement, then the system will write the data into the relevant table and display a confirmation back to the user. Should the user disagree with the number of rows then they cancel the upload and take the file away to investigate why the system can't see the correct number of rows, this is an action which happens outside of Seedcase.
lwjohnst86 marked this conversation as resolved.
Show resolved Hide resolved

![Logged In User Who Chooses to Use the Batch Upload Function with Existing Table Schema.](/design/images/user-flow-data-upload.png){#fig-batch-data-entry}

<!--TODO Ensure that the link above will still work once SKB has finished updating the diagrams-->

#### Manual Data Entry: Done in One Session

The user completes all fields in the form and clicks "Save and Submit". This sends
the data to the API layer where it is confirmed as valid, parcelled up
and submitted to the database. The database will then write the data
into a new record in the table (or tables). Once done the database will
confirm successful entry of data to the API which will in turn send the
confirmation back to the user via the frontend.

![Logged In User Who Manually Writes a New Row to the Data
Resource.](/design/images/runtime-manual-data-entry.png){#fig-manual-data-entry}

<!--TODO convert puml file to png so that the link above works-->

#### Manual Data Entry: Done in Multiple Sessions

There may be situations where an approved user will be prevented from
completing the data entry form in one session. In that case it would be
beneficial if there is an option of saving the data as it is, and be
able to return to the data entry at a later time. Much of the initial
workflow is the same as above, until the user is interrupted and selects
"Save" instead of "Save and Submit". This will send the data to the API with
a flag showing that fields may be incomplete, thus preventing the API
from rejecting the data due to NULL values. The API will submit the data
to the database along the incomplete flag.

When the Data Contributor goes back to the data entry at a later time, they will be
presented with the option of completing any incomplete records as well
as entering new data. If they click on "Complete Records" they are shown
the records that they have started but not submitted. Once they select a
partially completed record the frontend will request the currently
completed items from the database via the API layer before displaying
the entry form with the completed fields.

Once the user has completed more data they can either click on "Save" or
"Save and Submit". The first option will put them back to the top of this
workflow, the second will send the data back to the API layer for
validation. Once the data is validated it will be submitted to the
database. The database will then write the data into a new record in the
table (or tables) and update the flag to show the record is complete.
Once done the database will confirm successful entry of data to the API
which will in turn send the confirmation back to the user via the front
end.

![Logged In User Enters Data Manually in More Than One
Session](/design/images/runtime-manual-data-update.png){#fig-manual-data-update}

<!--TODO convert puml file to png so that the link above works-->