Skip to content

Latest commit

 

History

History
185 lines (142 loc) · 7.79 KB

README.md

File metadata and controls

185 lines (142 loc) · 7.79 KB

parkit 🅿️ 🚙

Park an archived project toolkit!

DISCLAIMERS:

Background

When a project comes to a completion, most analysts have folders (or .tar files) on Biowulf/Helix which contain either:

  • rawdata eg. Fastq files.
  • processed data generated by CCBR pipelines and/or other downstream analysis custom tools.

The analyst can use parkit to park these folders directly on to HPCDME's CCBR_Archive object store vault. A typical project, say ccbrXYZ, can be parked at /CCBR_Archive/GRIDFTP/Project_CCBR-XYZ with collections "Analysis" and "Rawdata".

!!! note projark command is preferred for CCBR project arkiving

Prerequisites:

  • On helix or biowulf you can get access to parkit by loading the appropriate conda env
%> . "/data/CCBR_Pipeliner/db/PipeDB/Conda/etc/profile.d/conda.sh"
%> conda activate parkit
  • HPC_DME_APIs package needs to be cloned and set up correctly. Run dm_generate_token to successfully generate a token before running parkit.

  • HPC_DM_UTILS environmental variable should be preset before calling parkit. It also needs to be passed as an argument to parkit_folder2hpcdme and parkit_tarball2hpcdme end-to-end workflows.

!!! warning If not on helix or biowulf then you will have to clone the repo and pip install it. Then setup HPC_DME_APIs appropriately.

Usage:

%> parkit --help
usage: parkit [-h] {createtar,createmetadata,createemptycollection,deposittar} ...

parkit subcommands to park data in HPCDME

positional arguments:
  {createtar,createmetadata,createemptycollection,deposittar}
                        Subcommand to run
    createtar           create tarball(and its filelist) from a project folder.
    createmetadata      create the metadata.json file required for a tarball (and its filelist)
    createemptycollection
                        creates empty project and analysis collections
    deposittar          deposit tarball(and filelist) into vault

options:
  -h, --help            show this help message and exit

Example:

  • Say you want to archive /data/CCBR/projects/CCBR-12345 folder to /CCBR_Archive/GRIDFTP/Project_CCBR-12345 collection on HPC-DME
  • you can run the following commands sequentially to do this:
# create the tarball
%> parkit createtar --folder /data/CCBR/projects/ccbr_12345
# the above command will creates the following files:
# - ccbr_12345.tar
# - ccbr_12345.tar.md5
# - ccbr_12345.tar.filelist
# - ccbr_12345.tar.filelist.md5

# create an empty collection on HPC-DME
%> parkit createemptycollection --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345 --projectdesc "testing" --projecttitle "test project 1"
# the above command creates collections:
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Analysis
# - /CCBR_Archive/GRIDFTP/Project_CCBR-12345/Rawdata

# create required metadata
%> parkit createmetadata --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline

# deposit the tar into HPC-DME
%> parkit deposittar --tarball /data/CCBR/projects/ccbr_12345.tar --dest /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# if ccbr_12345.tar is rawdata then "--collectiontype Rawdata" argument needs to be added to the above commandline

# bunch of extra files are created in the process
%> ls /data/CCBR/projects/ccbr_12345.tar*
/data/CCBR/projects/ccbr_12345.tar           /data/CCBR/projects/ccbr_12345.tar.filelist.md5            /data/CCBR/projects/ccbr_12345.tar.md5
/data/CCBR/projects/ccbr_12345.tar.filelist  /data/CCBR/projects/ccbr_12345.tar.filelist.metadata.json  /data/CCBR/projects/ccbr_12345.tar.metadata.json

# delete the recently parked project folder contents including hidden contents
%> rm -rf /data/CCBR/projects/CCBR-12345/*

# copy filelist into the empty project folder for future quick reference
%> cp /data/CCBR/projects/ccbr_12345.tar.filelist /data/CCBR/projects/CCBR-12345/ccbr_12345.tar.filelist

# delete files created by parkit
%> rm -f /data/CCBR/projects/ccbr_12345.tar*

# test results with
%> dm_get_collection /CCBR_Archive/GRIDFTP/Project_CCBR-12345
# Done!

We also have end-to-end slurm-supported folder-to-hpcdme and tarball-to-hpcdme workflows:

  • parkit_folder2hpcdme
  • parkit_tarball2hpcdme and
  • projark [ recommended for archiving CCBR projects to GRIPFTP folder under CCBR_Archive ]

If run with --executor slurm this interfaces with the job scheduler on Biowulf and submitted individual steps of these E2E workflows as interdependent jobs.

parkit_folder2hpcdme

%> parkit_folder2hpcdme --help
usage: parkit_folder2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--folder FOLDER] [--dest DEST] [--projectdesc PROJECTDESC]
                            [--projecttitle PROJECTTITLE] [--rawdata] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH [--version]

End-to-end parkit: Folder 2 HPCDME

options:
  -h, --help            show this help message and exit
  --restartfrom RESTARTFROM
                        if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
  --executor EXECUTOR   slurm or local
  --folder FOLDER       project folder to archive
  --dest DEST           vault collection path (Analysis goes under here!)
  --projectdesc PROJECTDESC
                        project description
  --projecttitle PROJECTTITLE
                        project title
  --rawdata             If tarball is rawdata and needs to go under folder Rawdata
  --cleanup             post transfer step to delete local files
  --hpcdmutilspath HPCDMUTILSPATH
                        what should be the value of env var HPC_DM_UTILS
  --version             print version

parkit_tarball2hpcdme

%> parkit_tarball2hpcdme --help
usage: parkit_tarball2hpcdme [-h] [--restartfrom RESTARTFROM] [--executor EXECUTOR] [--tarball TARBALL] [--dest DEST]
                             [--projectdesc PROJECTDESC] [--projecttitle PROJECTTITLE] [--cleanup] --hpcdmutilspath HPCDMUTILSPATH
                             [--version]

End-to-end parkit: Tarball 2 HPCDME

options:
  -h, --help            show this help message and exit
  --restartfrom RESTARTFROM
                        if restarting then restart from this step. Options are: createemptycollection, createmetadata, deposittar
  --executor EXECUTOR   slurm or local
  --tarball TARBALL     project tarball to archive
  --dest DEST           vault collection path (Analysis goes under here!)
  --projectdesc PROJECTDESC
                        project description
  --projecttitle PROJECTTITLE
                        project title
  --cleanup             post transfer step to delete local files
  --hpcdmutilspath HPCDMUTILSPATH
                        what should be the value of env var HPC_DM_UTILS
  --version             print version
> %projark --help
usage: projark [-h] --folder FOLDER --projectnumber PROJECTNUMBER
               [--executor EXECUTOR] [--rawdata] [--cleanup]

Wrapper for folder2hpcdme for quick CCBR project archiving!

options:
  -h, --help            show this help message and exit
  --folder FOLDER       Input folder path to archive
  --projectnumber PROJECTNUMBER
                        CCBR project number.. destination will be
                        /CCBR_Archive/GRIDFTP/Project_CCBR-<projectnumber>
  --executor EXECUTOR   slurm or local
  --rawdata             If tarball is rawdata and needs to go under folder
                        Rawdata
  --cleanup             post transfer step to delete local files