Park an archived project toolkit!
DISCLAIMERS:
When a project comes to a completion, most analysts have folders (or .tar
files) on Biowulf/Helix which contain either:
- rawdata eg. Fastq files.
- processed data generated by CCBR pipelines and/or other downstream analysis custom tools.
The analyst can use parkit
to park these folders directly on to HPCDME's CCBR_Archive object store vault. A typical project, say ccbrXYZ
, can be parked at /CCBR_Archive/GRIDFTP/Project_CCBR-XYZ
with collections "Analysis" and "Rawdata".
!!! note projark
command is preferred for CCBR project arkiving
- On helix or biowulf you can get access to
parkit
by loading the appropriate conda env
%> . "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh"
%> conda activate parkit
-
HPC_DME_APIs package needs to be cloned and set up correctly. Run
dm_generate_token
to successfully generate a token before runningparkit
. -
HPC_DM_UTILS environmental variable should be preset before calling
parkit
. It also needs to be passed as an argument toparkit_folder2hpcdme
andparkit_tarball2hpcdme
end-to-end workflows.
!!! warning If not on helix or biowulf then you will have to clone the repo and pip install it. Then setup HPC_DME_APIs appropriately.
%> parkit --help
usage: parkit [-h] {createtar,createmetadata,createemptycollection,deposittar} ...
parkit subcommands to park data in HPCDME
positional arguments:
{createtar,createmetadata,createemptycollection,deposittar}
Subcommand to run
createtar create tarball(and its filelist) from a project folder.
createmetadata create the metadata.json file required for a tarball (and its filelist)
createemptycollection
creates empty project and analysis collections
deposittar deposit tarball(and filelist) into vault
options:
-h, --help show this help message and exit
- Say you want to archive
/data/CCBR/projects/CCBR-12345
folder to/CCBR_Archive/GRIDFTP/Project_CCBR-12345
collection on HPC-DME - you can run the following commands sequentially to do this:
# create the tarball
%> parkit createtar --folder /data/CCBR/projects/ccbr_12345
# the above command will creates the following files:
# - ccbr_12345.tar
# - ccbr_12345.tar.md5
# - ccbr_12345.tar.filelist
# - ccbr_12345.tar.filelist.md5
# create an empty collection on HPC-DME
%> parkit createemptycollection --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345 --projectdesc "testing" --projecttitle "test project 1"
# the above command creates collections:
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Analysis
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Rawdata
# create required metadata
%> parkit createmetadata --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline
# deposit the tar into HPC-DME
%> parkit deposittar --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline
# bunch of extra files are created in the process
%> ls /data/CCBR/projects/ccbr_12345.tar*
/data/CCBR/projects/ccbr_12345.tar /data/CCBR/projects/ccbr_12345.tar.filelist.md5 /data/CCBR/projects/ccbr_12345.tar.md5
/data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/ccbr_12345.tar.filelist.metadata.json /data/CCBR/projects/ccbr_12345.tar.metadata.json
# delete the recently parked project folder contents including hidden contents
%> rm -rf /data/CCBR/projects/CCBR-12345/*
# copy filelist into the empty project folder for future quick reference
%> cp /data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/CCBR-12345/ccbr_12345.tar.filelist
# delete files created by parkit
%> rm -f /data/CCBR/projects/ccbr_12345.tar*
# test results with
%> dm_get_collection /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# Done!
We also have end-to-end slurm-supported folder-to-hpcdme and tarball-to-hpcdme workflows:
parkit_folder2hpcdme
parkit_tarball2hpcdme
andprojark
[ recommended for archiving CCBR projects to GRIPFTP folder under CCBR_Archive ]
If run with --executor slurm
this interfaces with the job scheduler on Biowulf and submitted individual steps of these E2E workflows as interdependent jobs.
%> parkit_folder2hpcdme --help
usage: parkit_folder2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--folder FOLDER] [--dest DEST] [--projectdesc PROJECTDESC]
[--projecttitle PROJECTTITLE] [--rawdata] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH [--version]
End-to-end parkit: Folder 2 HPCDME
options:
-h, --help show this help message and exit
--restartfrom RESTARTFROM
if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
--executor EXECUTOR slurm or local
--folder FOLDER project folder to archive
--dest DEST vault collection path (Analysis goes under here!)
--projectdesc PROJECTDESC
project description
--projecttitle PROJECTTITLE
project title
--rawdata If tarball is rawdata and needs to go under folder Rawdata
--cleanup post transfer step to delete local files
--hpcdmutilspath HPCDMUTILSPATH
what should be the value of env var HPC_DM_UTILS
--version print version
%> parkit_tarball2hpcdme --help
usage: parkit_tarball2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--tarball TARBALL] [--dest DEST]
[--projectdesc PROJECTDESC] [--projecttitle PROJECTTITLE] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH
[--version]
End-to-end parkit: Tarball 2 HPCDME
options:
-h, --help show this help message and exit
--restartfrom RESTARTFROM
if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
--executor EXECUTOR slurm or local
--tarball TARBALL project tarball to archive
--dest DEST vault collection path (Analysis goes under here!)
--projectdesc PROJECTDESC
project description
--projecttitle PROJECTTITLE
project title
--cleanup post transfer step to delete local files
--hpcdmutilspath HPCDMUTILSPATH
what should be the value of env var HPC_DM_UTILS
--version print version
> %projark --help
usage: projark [-h] --folder FOLDER --projectnumber PROJECTNUMBER
[--executor EXECUTOR] [--rawdata] [--cleanup]
Wrapper for folder2hpcdme for quick CCBR project archiving!
options:
-h, --help show this help message and exit
--folder FOLDER Input folder path to archive
--projectnumber PROJECTNUMBER
CCBR project number.. destination will be
/CCBR_Archive/GRIDFTP/Project_CCBR-<projectnumber>
--executor EXECUTOR slurm or local
--rawdata If tarball is rawdata and needs to go under folder
Rawdata
--cleanup post transfer step to delete local files